高精度音声認識のための識別学習
ATR視聴覚機構研究所 聴覚研究室 Erik McDermott、片桐 滋
ATR設立以来、私たちは、神経回路網と並列処理に関する研究の重要性に着目し、特に、音声認識における人工神経回路網による識別学習法の研究を行ってきました。本文で紹介いたしますプロトタイプのための識別学習法(LVQ)は、音声認識あるいは知覚における重要な単位である音素を正確かつ効率よく表現する手法として、この私たちの研究における中心的役割を果たしてきたものです。簡単、高速、そして高い識別能力。この魅力的な性質を兼ね備えているLVQは、私たちの研究で、さらに新しい応用の可能性と理論的広がりが明らかにされるに至りました。
筆頭著者であるマクダーモットは、スンタフォード大学を卒業後、できて間も無いATRでの研究生活を開始し、はや3年の歳月をATRそして日本で過ごしました。異なる文化の接点で新しい科学技術を生み出そうと、積極的に外国からの研究員を受け入れることを理念としてきたATRで、彼は、その理念の実現に最も貢献した一人であることを確信しています。(片桐 記)
1. Introduction
We are pattern-finding beings. Our perceptual system allows us to find form and
regularity amid the profusion of stimuli that constantly assail our senses. The
nature of pattern and form is an ancient and fundamental area of human inquiry,
and remains largely mysterious to us.
Pattern recognition includes the kinds of perceptual tasks that humans perform
effortlessly, such as the perception of objects and handwriting, or the perception
of speech. Though effortless for humans, these tasks are hardly trivial. The analysis
of exactly how patterns are recognized has proved extremely difficult and frustrating.
Particularly mysterious is the manner in which the same pattern may be recognized
in any number of different instantiations. We are able to recognize the same word
spoken by many different speakers whose voices are quite different from each other.
Even for the same speaker, different utterances of the same word may differ greatly,
and yet only rarely is this a problem for human communication.
One major focus of research in the ATR Auditory and Visual Perception Research
Laboratory is thus to address the question of how biological organisms process
and recognize patterns, and to investigate whether equally powerful perceptual
mechanisms could be implemented on computers.
We here describe an extremely simple computational algorithm for pattern recognition.
This method, applied to a number of tasks in the recognition of speech patterns,
has yielded some of the best results yet obtained for these tasks. The simplicity,
ease of implementation on parallel computers, and power of the algorithm, make
it an attractive candidate for future work in speech recognition.
2. Memory Based Pattern Recognition
The field of pattern recognition has existed for quite some time. A great number
of mathematical and statistical techniques have been investigated by different
researchers. A good compendium is found in Duda and Hartユs 1973book, メPattern
Classification and Scene Analysis.モ One classical approach to pattern recognition
is known as the メmemory-basedモ approach. This approach attempts to recognize patterns
by comparing new patterns with patterns that were previously seen and memorized.
The new pattern is then identified as being of the same type as the previously
seen pattern that most closely resembles it. This could mean memorizing every
single pattern that one sees, and comparing new patterns with a very large number
of memorized patterns. Though computationally intensive due to the large number
of memorized patterns, this is a very effective method which has been subject
to extensive mathematical analysis [5].
The method we adopt here, based on the work of Kohonen, is strongly related to
memory-based methods. The difference is that instead of memorizing every pattern
one sees, only a few representative patterns, or prototypes, are memorized. One
could hypothesize that when humans recognize a new object, a car for example,
they do so by matching the new visual stimuli to a number of prototypic car representations
stored in their brains. Given the tremendous capacity of human memory, it would
not be surprising if the human perceptual scheme used a scheme along these lines.
One might relate this to Platoユs notion of メIdealsモ, which are prototypic entities
that are free from the tremendous variety of specific instantiations of these
ideals. In other words, while all cars differ from one another in terms of their
exact size, weight, shape, etc., they all have a common quality of メcarness,モ
represented in the prototypes. The question will then be how to design the prototypes
so that they capture this common, invariant quality. The method we will describe
here is one answer to this question.
3. Prototype Based Classification and Learning
We should note that Kohonen[1]
provides an account of how this kind of prototype-based scheme might relate to
actual neurobiology. Kohonen suggests that patterns might be stored in collections
of neurons, each of which represents some feature of a whole pattern. New patterns
could be compared to stored patterns by means of numerous inter-neural connections
designed in such a way as to activate patterns that are similar to the new pattern,
and de-activate patterns that are different from the new pattern. Kohonen provides
detailed mathematical descriptions of the workings of such a network. The following
discussion is much less of an attempt to describe actual biology, although a neurobiological
interpretation could be given.
We will here make more concrete the ideas described above concerning memory-based
pattern recognition. A pattern is represented by a feature vector; i.e. a list
of fixed length containing values along different feature dimensions. Both input
patterns and prototype patterns are of this format. For a given classification
problem, each category is represented by a number of such prototype vectors. An
input pattern is then classified by measuring the distance between the associated
feature vector and each prototype vector, and identifying the closest prototype.
The category of the closest prototype vector will be given as the classification
of the unknown vector.
A key question is, How do we generate useful prototype vectors? Kohonen proposes
a simple method to do so. The prototype vectors are first initialized using any
of a number of methods. Then there is a training phase, which consists of the
following simple procedure. A set of training patterns is presented to the system,
which will attempt to identify each of these patterns. If a mistake occurs, the
prototype vectors are adapted. This involves pushing the closest but incorrect
prototype away, and pulling the correct prototype closer. By limiting oneself
to small displacements of this kind, the prototype vectors will eventually settle
into a stable arrangement that optimally identifies the training patterns.
Kohonen refers to this scheme as Learning Vector Quantization (LVQ). Vector Quantization
refers to a conventional method in signal processing by which a large number of
examples is reduced, or quantized, using a small number of prototypic vectors.
To lay more emphasis on the fact that the above scheme is a method to discriminate
among different classes, we here refer to it as Discriminative Training.
Figure 1illustrates
the outcome of Discriminative Training. In this figure, the task is to design
prototypes that will discriminate between two categories, メcirclesモ and メsquares.モ
Given examples for the two categories, the learning algorithm will attempt to
generate prototypes that capture not only the similarities within each category,
but also the differences between the categories. The prototypes are thus designed
with respect to each other, and not independently. The examples of circles
participate in the creation of prototypes of squares, and vice-versa. Note that
the prototypes do not match any of the examples exactly. Prototypes are
abstractions of actually seen examples, and may never be perfectly instantiated.
4. Prototype Based Shift-Tolerant Architecture for Phoneme Recognition
Having described the basic functioning and motivations for the discriminative
training method, we here turn to the application of this method to the problem
of phoneme recognition.
In speech recognition, one key question is how to handle the dynamic nature of
the speech signal. Time is an integral component of speech and language. Utterances
vary in duration and speed; any recognition algorithm has to account for this
somehow. Furthermore, one needs a way of recognizing utterances even if one is
not sure exactly when they begin and end in the speech signal.
Figure 2illustrates
the phoneme recognition architecture we designed in light of these considerations
[4].
Phoneme utterances are represented as a spectrogram of 150msec (15frames of 10msec
each), the assumption being that 150milliseconds is long, enough to cover most
phoneme utterances. During recognition, a 7-frame temporal window is shifted over
the utterance. Each position of the window yields a pattern vector, which is then
compared to all the prototype vectors. The distance between the pattern vector
and the closest prototype vector is used to calculate an activation for each phoneme
category. When the window has been fully shifted over the utterance, the category
activations are summed to yield a final activation for each category. The utterance
is identified as the category with the greatest activation.
The idea here is that by using prototype vectors in this segmental fashion, one
can recognize phoneme utterances regardless of their actual duration (which may
be less than 150msec), and regardless of their exact beginning and end within
the 150msec token. As long as the phoneme occurs somewhere in the 15frames,
prototype vectors corresponding to segments of that phoneme will be more activated
than prototypes for different phonemes.
During the training phase, then, the same method as described in Section 2above
was applied for each segment obtained at every shift of the window over the token.
The hope here is that training in this segmental fashion will allow the prototype
vectors to learn information concerning each phoneme that will help separate the
different phonemes from each other.
The architecture shown in Figure 2was designed to recognize 18Japanese consonants.
Tokens for these 18consonants were extracted from the speech data for a single
speaker. The system was trained on half of the available tokens, and then tested
using on the remaining half. The test-data performance for this system was 97.1%
correctly recognized phoneme tokens.
In order to put this high recognition performance in context with other methods,
we here briefly describe an alternative method for discriminative phoneme recognition,
the Time Delay Neural Networks (TDNN) developed at ATR by Waibel et al. [2,3].
This is a multi-layered neural network trained by the Back-propagation algorithm
to minimize the number of classification errors. High recognition results were
obtained using this algorithm for a number of tasks. On the same all-consonants
recognition task, the TDNN performance is of 96.7% correct recognition.
Thus, similarly high recognition scores were obtained for Discriminative Training
and TDNN. We should point out, however, that Discriminative Training is significantly
simpler compared than the multi-layered, Back-propagation trained TDNN. This greater
simplicity in turn means that high learning speed is more readily obtained for
Discriminative Training than for TDNN.
5. The Promise of Discriminative Training for Speech Recognition.
The recognition architecture we present here is the very first step we took in
applying the Discriminative Training algorithm to tasks in speech recognition.
Since this first step, we at ATR and others, both in Japan and the USA, have used
Discriminative Training as a base for a variety of recognition systems.
We believe that Discriminative Training is endowed with important properties
that make it an attractive method on which to base a number of speech recognition
strategies.
First, the method performs well. The performance described here is the highest
reported for the all-consonant phoneme recognition task.
Second, the method is simple and fast. The prototypes that form the core of the
system are simple feature vectors; the main calculation that the system requires
is that for measuring the distance between the prototypes and the input pattern.
The implementation of Discriminative Training is thus very straightforward. Furthermore,
the method can easily be parallelized. Given the appropriate parallel hardware,
each vector can be represented as a collection of independent elements. The bulk
of many vector operations, such as the distance calculation, can be performed
by processing all these elements simultaneously, yielding a tremendous speed-up
compared to processing the same elements one after the other on a conventional
computer.
Third, Discriminative Training is well understood from a mathematical point of
view. A rigorous analysis of the mathematical properties of the algorithm, explaining
connections to statistical theory and describing Discriminative Training as a
gradient descent method, can be found in Katagiri et al. [6].
Fourth, a number of possibilities exist for the application of prototype-based
learning to longer utterance recognition, i.e. word or sentence recognition. Two
proposals for this have already been implemented, with encouraging results [7,8].
6. Summary
Discriminative Training designs prototypes that link together different examples
within each category, while distinguishing these examples from all other categories.
This method is simple and fast, and our results indicate that it is endowed with
impressive classification power. We obtained a recognition rate of 97% for Japanese
consonants, a result comparable to that for the TDNN developed by Waibel et al.
Here we have only described the most fundamental aspects of the algorithm, which
has, since we first presented this speech recognition architecture, been applied
in a number of different ways to a variety of recognition tasks, including word
and sentence recognition. The encouraging results obtained for these prototype-based
methods suggest that Discriminative Training may be an effective approach to speech
recognition.
Acknowledgments
We would like to acknowledge the significant contribution that Manami Ohta made
to this work. Furthermore, we would like to thank Alex Waibel and Patrick Haffner
for engaging in friendly scientific competition with us and providing useful advice.
We would also like to thank Peter Davis for providing useful suggestions, and
Hitoshi Iwamida, Takashi Komori and Andy Duchon, who made useful comments on drafts
of this article. Finally, we would like to thank Yohユichi Tohkura and Eiji Yodogawa,
without whose enthusiastic support this work would not have been possible.
参考文献