高精度音声認識のための識別学習

ATR視聴覚機構研究所　聴覚研究室　Erik McDermott、片桐　　滋

　ATR設立以来、私たちは、神経回路網と並列処理に関する研究の重要性に着目し、特に、音声認識における人工神経回路網による識別学習法の研究を行ってきました。本文で紹介いたしますプロトタイプのための識別学習法（LVQ）は、音声認識あるいは知覚における重要な単位である音素を正確かつ効率よく表現する手法として、この私たちの研究における中心的役割を果たしてきたものです。簡単、高速、そして高い識別能力。この魅力的な性質を兼ね備えているLVQは、私たちの研究で、さらに新しい応用の可能性と理論的広がりが明らかにされるに至りました。
　筆頭著者であるマクダーモットは、スンタフォード大学を卒業後、できて間も無いATRでの研究生活を開始し、はや3年の歳月をATRそして日本で過ごしました。異なる文化の接点で新しい科学技術を生み出そうと、積極的に外国からの研究員を受け入れることを理念としてきたATRで、彼は、その理念の実現に最も貢献した一人であることを確信しています。（片桐　記）

1. Introduction
　We are pattern-finding beings. Our perceptual system allows us to find form and regularity amid the profusion of stimuli that constantly assail our senses. The nature of pattern and form is an ancient and fundamental area of human inquiry, and remains largely mysterious to us.
　Pattern recognition includes the kinds of perceptual tasks that humans perform effortlessly, such as the perception of objects and handwriting, or the perception of speech. Though effortless for humans, these tasks are hardly trivial. The analysis of exactly how patterns are recognized has proved extremely difficult and frustrating. Particularly mysterious is the manner in which the same pattern may be recognized in any number of different instantiations. We are able to recognize the same word spoken by many different speakers whose voices are quite different from each other. Even for the same speaker, different utterances of the same word may differ greatly, and yet only rarely is this a problem for human communication.
　One major focus of research in the ATR Auditory and Visual Perception Research Laboratory is thus to address the question of how biological organisms process and recognize patterns, and to investigate whether equally powerful perceptual mechanisms could be implemented on computers.
　We here describe an extremely simple computational algorithm for pattern recognition. This method, applied to a number of tasks in the recognition of speech patterns, has yielded some of the best results yet obtained for these tasks. The simplicity, ease of implementation on parallel computers, and power of the algorithm, make it an attractive candidate for future work in speech recognition.

2. Memory Based Pattern Recognition
　The field of pattern recognition has existed for quite some time. A great number of mathematical and statistical techniques have been investigated by different researchers. A good compendium is found in Duda and Hartﾕs 1973book, ﾒPattern Classification and Scene Analysis.ﾓ One classical approach to pattern recognition is known as the ﾒmemory-basedﾓ approach. This approach attempts to recognize patterns by comparing new patterns with patterns that were previously seen and memorized. The new pattern is then identified as being of the same type as the previously seen pattern that most closely resembles it. This could mean memorizing every single pattern that one sees, and comparing new patterns with a very large number of memorized patterns. Though computationally intensive due to the large number of memorized patterns, this is a very effective method which has been subject to extensive mathematical analysis ^[5].
　The method we adopt here, based on the work of Kohonen, is strongly related to memory-based methods. The difference is that instead of memorizing every pattern one sees, only a few representative patterns, or prototypes, are memorized. One could hypothesize that when humans recognize a new object, a car for example, they do so by matching the new visual stimuli to a number of prototypic car representations stored in their brains. Given the tremendous capacity of human memory, it would not be surprising if the human perceptual scheme used a scheme along these lines. One might relate this to Platoﾕs notion of ﾒIdealsﾓ, which are prototypic entities that are free from the tremendous variety of specific instantiations of these ideals. In other words, while all cars differ from one another in terms of their exact size, weight, shape, etc., they all have a common quality of ﾒcarness,ﾓ represented in the prototypes. The question will then be how to design the prototypes so that they capture this common, invariant quality. The method we will describe here is one answer to this question.

3. Prototype Based Classification and Learning
　We should note that Kohonen^[1] provides an account of how this kind of prototype-based scheme might relate to actual neurobiology. Kohonen suggests that patterns might be stored in collections of neurons, each of which represents some feature of a whole pattern. New patterns could be compared to stored patterns by means of numerous inter-neural connections designed in such a way as to activate patterns that are similar to the new pattern, and de-activate patterns that are different from the new pattern. Kohonen provides detailed mathematical descriptions of the workings of such a network. The following discussion is much less of an attempt to describe actual biology, although a neurobiological interpretation could be given.
　We will here make more concrete the ideas described above concerning memory-based pattern recognition. A pattern is represented by a feature vector; i.e. a list of fixed length containing values along different feature dimensions. Both input patterns and prototype patterns are of this format. For a given classification problem, each category is represented by a number of such prototype vectors. An input pattern is then classified by measuring the distance between the associated feature vector and each prototype vector, and identifying the closest prototype. The category of the closest prototype vector will be given as the classification of the unknown vector.
　A key question is, How do we generate useful prototype vectors? Kohonen proposes a simple method to do so. The prototype vectors are first initialized using any of a number of methods. Then there is a training phase, which consists of the following simple procedure. A set of training patterns is presented to the system, which will attempt to identify each of these patterns. If a mistake occurs, the prototype vectors are adapted. This involves pushing the closest but incorrect prototype away, and pulling the correct prototype closer. By limiting oneself to small displacements of this kind, the prototype vectors will eventually settle into a stable arrangement that optimally identifies the training patterns.
　Kohonen refers to this scheme as Learning Vector Quantization (LVQ). Vector Quantization refers to a conventional method in signal processing by which a large number of examples is reduced, or quantized, using a small number of prototypic vectors. To lay more emphasis on the fact that the above scheme is a method to discriminate among different classes, we here refer to it as Discriminative Training.
　Figure 1illustrates the outcome of Discriminative Training. In this figure, the task is to design prototypes that will discriminate between two categories, ﾒcirclesﾓ and ﾒsquares.ﾓ Given examples for the two categories, the learning algorithm will attempt to generate prototypes that capture not only the similarities within each category, but also the differences between the categories. The prototypes are thus designed with respect to each other, and not independently. The examples of circles participate in the creation of prototypes of squares, and vice-versa. Note that the prototypes do not match any of the examples exactly. Prototypes are abstractions of actually seen examples, and may never be perfectly instantiated.

4. Prototype Based Shift-Tolerant Architecture for Phoneme Recognition
　Having described the basic functioning and motivations for the discriminative training method, we here turn to the application of this method to the problem of phoneme recognition.
　In speech recognition, one key question is how to handle the dynamic nature of the speech signal. Time is an integral component of speech and language. Utterances vary in duration and speed; any recognition algorithm has to account for this somehow. Furthermore, one needs a way of recognizing utterances even if one is not sure exactly when they begin and end in the speech signal.
　Figure 2illustrates the phoneme recognition architecture we designed in light of these considerations ^[4]. Phoneme utterances are represented as a spectrogram of 150msec (15frames of 10msec each), the assumption being that 150milliseconds is long, enough to cover most phoneme utterances. During recognition, a 7-frame temporal window is shifted over the utterance. Each position of the window yields a pattern vector, which is then compared to all the prototype vectors. The distance between the pattern vector and the closest prototype vector is used to calculate an activation for each phoneme category. When the window has been fully shifted over the utterance, the category activations are summed to yield a final activation for each category. The utterance is identified as the category with the greatest activation.
　The idea here is that by using prototype vectors in this segmental fashion, one can recognize phoneme utterances regardless of their actual duration (which may be less than 150msec), and regardless of their exact beginning and end within the 150msec token. As long as the phoneme occurs somewhere in the 15frames, prototype vectors corresponding to segments of that phoneme will be more activated than prototypes for different phonemes.
　During the training phase, then, the same method as described in Section 2above was applied for each segment obtained at every shift of the window over the token. The hope here is that training in this segmental fashion will allow the prototype vectors to learn information concerning each phoneme that will help separate the different phonemes from each other.
　The architecture shown in Figure 2was designed to recognize 18Japanese consonants. Tokens for these 18consonants were extracted from the speech data for a single speaker. The system was trained on half of the available tokens, and then tested using on the remaining half. The test-data performance for this system was 97.1% correctly recognized phoneme tokens.
　In order to put this high recognition performance in context with other methods, we here briefly describe an alternative method for discriminative phoneme recognition, the Time Delay Neural Networks (TDNN) developed at ATR by Waibel et al. ^[2,3]. This is a multi-layered neural network trained by the Back-propagation algorithm to minimize the number of classification errors. High recognition results were obtained using this algorithm for a number of tasks. On the same all-consonants recognition task, the TDNN performance is of 96.7% correct recognition.
　Thus, similarly high recognition scores were obtained for Discriminative Training and TDNN. We should point out, however, that Discriminative Training is significantly simpler compared than the multi-layered, Back-propagation trained TDNN. This greater simplicity in turn means that high learning speed is more readily obtained for Discriminative Training than for TDNN.

5. The Promise of Discriminative Training for Speech Recognition.
　The recognition architecture we present here is the very first step we took in applying the Discriminative Training algorithm to tasks in speech recognition. Since this first step, we at ATR and others, both in Japan and the USA, have used Discriminative Training as a base for a variety of recognition systems.
　We believe that Discriminative Training is endowed with important properties that make it an attractive method on which to base a number of speech recognition strategies.
　First, the method performs well. The performance described here is the highest reported for the all-consonant phoneme recognition task.
　Second, the method is simple and fast. The prototypes that form the core of the system are simple feature vectors; the main calculation that the system requires is that for measuring the distance between the prototypes and the input pattern. The implementation of Discriminative Training is thus very straightforward. Furthermore, the method can easily be parallelized. Given the appropriate parallel hardware, each vector can be represented as a collection of independent elements. The bulk of many vector operations, such as the distance calculation, can be performed by processing all these elements simultaneously, yielding a tremendous speed-up compared to processing the same elements one after the other on a conventional computer.
　Third, Discriminative Training is well understood from a mathematical point of view. A rigorous analysis of the mathematical properties of the algorithm, explaining connections to statistical theory and describing Discriminative Training as a gradient descent method, can be found in Katagiri et al. ^[6].
　Fourth, a number of possibilities exist for the application of prototype-based learning to longer utterance recognition, i.e. word or sentence recognition. Two proposals for this have already been implemented, with encouraging results ^[7,8].

6. Summary
　Discriminative Training designs prototypes that link together different examples within each category, while distinguishing these examples from all other categories. This method is simple and fast, and our results indicate that it is endowed with impressive classification power. We obtained a recognition rate of 97% for Japanese consonants, a result comparable to that for the TDNN developed by Waibel et al.
　Here we have only described the most fundamental aspects of the algorithm, which has, since we first presented this speech recognition architecture, been applied in a number of different ways to a variety of recognition tasks, including word and sentence recognition. The encouraging results obtained for these prototype-based methods suggest that Discriminative Training may be an effective approach to speech recognition.

Acknowledgments
　We would like to acknowledge the significant contribution that Manami Ohta made to this work. Furthermore, we would like to thank Alex Waibel and Patrick Haffner for engaging in friendly scientific competition with us and providing useful advice. We would also like to thank Peter Davis for providing useful suggestions, and Hitoshi Iwamida, Takashi Komori and Andy Duchon, who made useful comments on drafts of this article. Finally, we would like to thank Yohﾕichi Tohkura and Eiji Yodogawa, without whose enthusiastic support this work would not have been possible.

参考文献