TR-A-0063

TR-A-0063 :1989.10.23

パターソン,ロイ D, 平原達也

DFTと聴覚スペクトログラムを用いたHMM音声認識

Abstract:As the performance of speech recognition systems improves, expectations rise and people contemplate using recognition systems in office environments. Unfortunately, the performance of current recognition systems deteriorates badly when they are required to operate in noise -- even office noise. In an attempt to improve performance in noise Ghitza(1988) replaced the traditional Fourier frontend of a speech recognition system with an auditory frontend composed of a bank of auditory filters, a bank of hair cells and an Ensemble-Interval Historgram (EIH) used to summarize the information flowing from the bank of hair cells. It is this final stage that provides most of the noise resistance and gives the auditory model its name, EIH. The recognizer is based on a DTW system described by Wilpon and Rabiner (1985) and it was used to compare the performance of the EIH frontend with the traditional FFT frontend. The results show that in noise free conditions the EIH and FFT systems support essentially the same performance (greater than 90% correct on a word recognition task). However, as the level of the background noise increases, the performance of the FFT system deteriorates more rapidly than that of the EIH system. In the case of male speakers the advantage of the EIH system in noise is dramatic; in the case of female speakers, however, the superiority of the EIH system is marginal.

In this paper we describe a similar attempt to demonstrate the advantage of an auditory frontend for a recognition system that has to operate in noise. Instead of the EIH auditory model, we use an Auditory Sensation Processor (ASP) that simulates the auditory images that we experience in response to music and speech sounds. The architecture of ASP is similar to that of EIH, inasmuch as it has three stages -- a filterbank, a 'haircell bank' and a 'neural processor' that removes noise in the time domain using a correlation process, but ASP has several potential advantages. Firstly, the haircell stage of the ASP model includes lateral suppression which sharpens features in the output of the filterbank, and so it might be expected to improve the performance in noise over that provided by the EIH model. Secondly, although the 'neural processor' in ASP has the same function as that in EIH, it uses a data driven mechanism that is simpler than autocorrelation and so the ASP frontend is probably faster. Finally, the output of the ASP model is very similar to the traditional spectrogram and so it is easier to read than EIH output and it can be connected directly to recognition systems designed to work with spectrographic input.

The speech recognizer was an HMM system described in Waibel, Hanazawa, Hinton, Shikano and Lang (1988). It was designed to take spectrographic input and its performance on a syllable spotting task is well documented. In the present study, the speech stimuli used by Waibel et al (1988) were converted to spectrograms using both the traditional DFT procedure and an auditory model referred to as ASP. In one condition the speech was noise free and in the other a loud pink noise was added to the speech sounds. The DFT and SAS systems were trained separately with the clean speech and the noisy speech using half of the syllable database. Then, they were tested on the other half of the data base using both clean speech and noisy speech. This procedure enabled us to test the ability of the two recognizer systems to generalize what is learned from one form of the speech (clean or noisy) to the other (noisy or clean) -- a particularly relevant form of generalization for a practical recognizer.

The first section of this paper describes ASP and the tuning of the model for use with speech in noise. The second section describes the results of the recognition tests.