In this paper we describe a similar attempt to demonstrate the advantage of an auditory frontend for a recognition system that has to operate in noise. Instead of the EIH auditory model, we use an Auditory Sensation Processor (ASP) that simulates the auditory images that we experience in response to music and speech sounds. The architecture of ASP is similar to that of EIH, inasmuch as it has three stages -- a filterbank, a 'haircell bank' and a 'neural processor' that removes noise in the time domain using a correlation process, but ASP has several potential advantages. Firstly, the haircell stage of the ASP model includes lateral suppression which sharpens features in the output of the filterbank, and so it might be expected to improve the performance in noise over that provided by the EIH model. Secondly, although the 'neural processor' in ASP has the same function as that in EIH, it uses a data driven mechanism that is simpler than autocorrelation and so the ASP frontend is probably faster. Finally, the output of the ASP model is very similar to the traditional spectrogram and so it is easier to read than EIH output and it can be connected directly to recognition systems designed to work with spectrographic input.
The speech recognizer was an HMM system described in Waibel, Hanazawa, Hinton, Shikano and Lang (1988). It was designed to take spectrographic input and its performance on a syllable spotting task is well documented. In the present study, the speech stimuli used by Waibel et al (1988) were converted to spectrograms using both the traditional DFT procedure and an auditory model referred to as ASP. In one condition the speech was noise free and in the other a loud pink noise was added to the speech sounds. The DFT and SAS systems were trained separately with the clean speech and the noisy speech using half of the syllable database. Then, they were tested on the other half of the data base using both clean speech and noisy speech. This procedure enabled us to test the ability of the two recognizer systems to generalize what is learned from one form of the speech (clean or noisy) to the other (noisy or clean) -- a particularly relevant form of generalization for a practical recognizer.
The first section of this paper describes ASP and the tuning of the model for use with speech in noise. The second section describes the results of the recognition tests.