Erik McDermott and Shigeru Katagiri
Recurrent LVQ for Phoneme
Recognition
Abstract:A key issue in speech recognition is the representation of the temporal
structure of the speech signal. In Hidden Markov Models the sequential
nature of the speech signal is explicitly represented by matching
incoming speech against a connected sequence of states, each of which
models speech at a given temporal position. However, explicit modelling
of this sort requires that one design the state sequences manually, and
decide upon the appropriate number and connectivity of the states. It
might be advantageous to learn how to represent temporal structure
implicitly. Recurrent “neural” networks are a promising method for
achieving this.
In previous work we reported high recognition rates for simple
LVQ (Learning Vector Quantization) networks trained to recognize
phoneme tokens that are shifted in time. To represent the acoustic
context, these networks used a fixed-width window which was shifted
over the input. In this method, the fixed-width window was assumed to
be sufficient to represent the necessary context. However, the fact that the
length of the window is fixed means that phonemes that are either
longer or shorter than the window will not be optimally represented.
Here we examine whether recurrent LVQ networks can represent
context more efficiently.