岩見田均, 片桐滋, マクダモット,E, 東倉洋一
LVQコードブックとHMMを使った
ハイブリッド型音声認識システム
Abstract:An ongoing area of research concerns the application of connectionist methods
to speech recognition. Here, a new speech recognition system using the neurally-inspired
Learning Vector Quantization (LVQ) to train HMM codebooks is described.
Both LVQ and HMMs are stochastic algorithms holding considerable promise for
speech recognition. In particular, LVQ is a vector quantizer with very powerful classification
ability, as shown in the high phoneme recognition rates obtained in [McDermott&
Katagiri; Proc. of ICASSP 89, 9.S3.1, pp.81-84, 1989].HMMs, on the other hand,
have the advantage that phone models can easily be concatenated to produce long utterance
models, such as word or sentence models. The new algorithm described here
combines the advantages inherent in each of these two algorithms. Instead of using a
conventional, k-means generated codebook in the HMMs, the new system uses LVQ
to adapt the codebook reference vectors so as to minimize the number of errors these
reference vectors make when used for nearest neighbor classification of training vectors.
The LVQ codebook can then provide the HMMs with high classification power
at the phonemic level.
Using a large vocabulary database of 5240 common Japanese words uttered in
isolation by a male speaker, two main comparisons were performed to evaluate the
LVQ-HMM hybrid:
(1) Comparison of the hybrid algorithm with TDNN [Waibel et. al.; Proc. of
ICASSP 88, S3.3, pp.107-110, 1988] and Shift-Tolerant LVQ [McDermott & Katagiri;
Proc. of ICASSP 89, 9.S3.1, pp. 81-84, 1989]. Phoneme recognition experiments were
performed using the same data as used in TDNN and Shift-Tolerant LVQ, for 7 Japanese
phonemic classes. For these tasks, the new algorithm yielded very high performance,
comparable to that of TDNN and Shift-Tolerant LVQ.
(2) Comparison of the LVQ-HMM hybrid with HMMs using conventional k-means
generated codebooks. Phoneme recognition experiments were performed using
phoneme tokens for 25 phoneme categories. In these experiments, the LVQ-HMM
hybrid achieved recognition error rates 2 or 3 times lower than those of HMMs using
codebooks designed by k-means clustering.
These results demonstrate that by using LVQ instead of k-means to generate
HMM codebooks, the high discriminant ability of LVQ can be integrated into an
HMM architecture easily extendible to longer utterance models, such as word or sentence
models.