深田俊明
Acoustic and Pronunciation Modeling
in Automatic Speech Recognition
Abstract:In conventional speech recognition systems, after input speech is pre-processed by speech analysis, recognition results are obtained by finding (or searching) for probable hypotheses in terms of
maximum likelihood criterion by using three knowledge resources, that is, acoustic models, language models and pronunciation models (i.e., dictionary). To develop sophisticated speech analysis,
acoustic modeling, language modeling, pronunciation modeling and search are indispensable for
improving performance of speech recognition systems. This report presents novel algorithms to
improve acoustic models and pronunciation models. As for improved acoustic modeling, three kinds
of approaches can be considered; (1) improvement of the current modeling, (2) incorporation of
new acoustic features for acoustic modeling, and (3) development of a new acoustic modeling
paradigm. In this report, (1) acoustic modeling using speaker normalization technique,
(2) speech recognition using segment boundary information, and (3) model parameter estimation
for mixture density segment models, are proposed for the above three approaches, respectively.
In spontaneous speech recognition, as word pronunciation varies more than in read speech, actual
pronunciation variations have to be incorporated into the pronunciation dictionary. This report
presents a method for automatically generating multiple pronunciation dictionaries based on a
pronunciation neural network that can predict plausible pronunciations from the standard pronunciation. Experimental results on spontaneous speech recognition show that automatically-derived
pronunciation dictionaries give higher recognition rates than the conventional dictionary.