TR-A-0061 :1989.8.25

Hitoshi IWAMIDA, Shigeru KATAGIRI, Erik MCDERMOTT and Yoh'ichi TOHKURA

A Hybrid Speech Recognition System Using HMMs with an LVQ-trained Codebook

Abstract:An ongoing area of research concerns the application of connectionist methods to speech recognition. Here, a new speech recognition system using the neurally-inspired Learning Vector Quantization (LVQ) to train HMM codebooks is described. Both LVQ and HMMs are stochastic algorithms holding considerable promise for speech recognition. In particular, LVQ is a vector quantizer with very powerful classification ability, as shown in the high phoneme recognition rates obtained in [McDermott& Katagiri; Proc. of ICASSP 89, 9.S3.1, pp.81-84, 1989].HMMs, on the other hand, have the advantage that phone models can easily be concatenated to produce long utterance models, such as word or sentence models. The new algorithm described here combines the advantages inherent in each of these two algorithms. Instead of using a conventional, k-means generated codebook in the HMMs, the new system uses LVQ to adapt the codebook reference vectors so as to minimize the number of errors these reference vectors make when used for nearest neighbor classification of training vectors. The LVQ codebook can then provide the HMMs with high classification power at the phonemic level. Using a large vocabulary database of 5240 common Japanese words uttered in isolation by a male speaker, two main comparisons were performed to evaluate the LVQ-HMM hybrid: (1) Comparison of the hybrid algorithm with TDNN [Waibel et. al.; Proc. of ICASSP 88, S3.3, pp.107-110, 1988] and Shift-Tolerant LVQ [McDermott & Katagiri; Proc. of ICASSP 89, 9.S3.1, pp. 81-84, 1989]. Phoneme recognition experiments were performed using the same data as used in TDNN and Shift-Tolerant LVQ, for 7 Japanese phonemic classes. For these tasks, the new algorithm yielded very high performance, comparable to that of TDNN and Shift-Tolerant LVQ. (2) Comparison of the LVQ-HMM hybrid with HMMs using conventional k-means generated codebooks. Phoneme recognition experiments were performed using phoneme tokens for 25 phoneme categories. In these experiments, the LVQ-HMM hybrid achieved recognition error rates 2 or 3 times lower than those of HMMs using codebooks designed by k-means clustering. These results demonstrate that by using LVQ instead of k-means to generate HMM codebooks, the high discriminant ability of LVQ can be integrated into an HMM architecture easily extendible to longer utterance models, such as word or sentence models.