Hidefumi Sawai, Alex Waibel, Patrick Haffner,
Masanori Miyatake and Kiyohiro Shikano
Parallelism, Hierarchy, Scaling
in Time-Delay Neural Networks
for Spotting Phonemes and CV-Syllables
Abstract:Syllable or phoneme spotting, if reliably achieved, provides a good solution to the spoken word
and/or continuous speech recognition problem. We previously showed that the Time-Delay Neural
Network (TDNN) provided excellent recognition performance for all phonemic subcategories
(nasals, fricatives, vowels, etc.). To extend this encouraging performance of TDNNs to all
phoneme recognition and word/continuous speech recognition, we show several techniques:
Firstly, we show that it is indeed possible to scale up the TDNN to a large phonemic TDNN aimed
at discriminating all phonemes without loss of recognition performance and without excessive
training tokens. Secondly, we propose fast back-propagation learning methods which make it
possible to train a large phonemic TDNN within 1.5 hours. Finally, we show several methods for
spotting Japanese CV syllables/phonemes in input speech based on TDNNs: we constructed a
TDNN which can discriminate a single CV syllable or phoneme. Syllable and phoneme spotting
experiments show excellent results, including syllable and phoneme spotting rates of better than
96.7% and 92% correct, respectively. These spotting techniques are proved to be a good step
toward continuous speech recognition.