Kazuaki Obara and Tatsuya Hirahara
Auditory front-end in DTW word recognition
under noisy, reverberant and multi-speaker conditions.
Abstract:In this report three front-ends, a fixed Q cochlear filter (FQF), an adaptive Q
cochlear filter (AQF), and a Bark DFT (DFT), are compared for use as the
front-end of a DTW system. The FQF is a conventional cascade/parallel type
cochlear filter which simulates the asymmetrical filtering characteristics of a
basilar membrane system. The AQF is a nonlinear cochlear filter which simulates
three level-dependent characteristics of a basilar membrane system [T. Hirahara
et al., Proc. ICASSP, 496-499 (1989)]. The DFT front-end generates 64-channel
Bark scale coefficients based on a 512-point DFT magnitude spectrum. These
three front-ends have 64 channels covering the frequency range from 1.5 to 19.5
Bark. Recognition performance for clean speech, speech degraded by adding
noise and/or reverberation, and under multi speaker conditions, are compared.
Four signal-to-noise ratios, S/N=∞ (clean), 40, 20 and 10 dB, are set by adding
different levels of pink noise to speech data. For reverberant speech, the impulse
responses obtained in the ATR reverberation room, RT=0.2 and 1.1 seconds,
were convolved with speech data. Speech data used in the experiments were 216
phoneme-balanced Japanese words uttered by 2 male and 2 female speakers. A
standard dynamic time warping (DTW) system was used as a back-end. The
experiments results are as follows: (1) For clean speech, AQF performance is
equal to that of DFT. (2) For noisy speech, AQF performance is equal to that of
FQF but more robust than that of DFT. (3) For reverberant speech, AQF is
affected more than DFT but the performance is better than that of FQF. (4) For
speaker variation, AQF gives better performance than do FQF or DFT. While
the advantage of the AQF front-end is small with an HMM back-end [T. Hirahara
et al. Proc. ICSLP, 381-384 (1990)], these results show that the AQF can be a
better front-end for a DTW recognition system.