TR-IT-0271

TR-IT-0271 :1998.09.04

深田俊明

Acoustic and Pronunciation Modeling in Automatic Speech Recognition

Abstract:In conventional speech recognition systems, after input speech is pre-processed by speech analysis, recognition results are obtained by finding (or searching) for probable hypotheses in terms of maximum likelihood criterion by using three knowledge resources, that is, acoustic models, language models and pronunciation models (i.e., dictionary). To develop sophisticated speech analysis, acoustic modeling, language modeling, pronunciation modeling and search are indispensable for improving performance of speech recognition systems. This report presents novel algorithms to improve acoustic models and pronunciation models. As for improved acoustic modeling, three kinds of approaches can be considered; (1) improvement of the current modeling, (2) incorporation of new acoustic features for acoustic modeling, and (3) development of a new acoustic modeling paradigm. In this report, (1) acoustic modeling using speaker normalization technique, (2) speech recognition using segment boundary information, and (3) model parameter estimation for mixture density segment models, are proposed for the above three approaches, respectively. In spontaneous speech recognition, as word pronunciation varies more than in read speech, actual pronunciation variations have to be incorporated into the pronunciation dictionary. This report presents a method for automatically generating multiple pronunciation dictionaries based on a pronunciation neural network that can predict plausible pronunciations from the standard pronunciation. Experimental results on spontaneous speech recognition show that automatically-derived pronunciation dictionaries give higher recognition rates than the conventional dictionary.