TR-A-0115 :1991.6.12

Erik McDermott and Shigeru Katagiri

Recurrent LVQ for Phoneme Recognition

Abstract:A key issue in speech recognition is the representation of the temporal structure of the speech signal. In Hidden Markov Models the sequential nature of the speech signal is explicitly represented by matching incoming speech against a connected sequence of states, each of which models speech at a given temporal position. However, explicit modelling of this sort requires that one design the state sequences manually, and decide upon the appropriate number and connectivity of the states. It might be advantageous to learn how to represent temporal structure implicitly. Recurrent “neural” networks are a promising method for achieving this. In previous work we reported high recognition rates for simple LVQ (Learning Vector Quantization) networks trained to recognize phoneme tokens that are shifted in time. To represent the acoustic context, these networks used a fixed-width window which was shifted over the input. In this method, the fixed-width window was assumed to be sufficient to represent the necessary context. However, the fact that the length of the window is fixed means that phonemes that are either longer or shorter than the window will not be optimally represented. Here we examine whether recurrent LVQ networks can represent context more efficiently.