TR-A-0096

TR-A-0096 :1990.12.14

Seiichi TENPAKU and Tatsuya HIRAHARA

A glottal waveform model for high quality speech synthesis

Abstract:A new glottal waveform model for high quality speech synthesis is proposed and the results of the perceptual evaluations for synthesized speech using the proposed model and other models are compared. The proposed glottal waveform model consists of two parts; a waveform generator and a spectrum shaping filter. A third order polynomial, whose coefficients are determined by combinations of open quotient (OQ), speed quotient (SQ), amplitude of voicing (AV) and fundamental frequency (F0), is used for the waveform generator. A second order infinite impulse response (IIR) filter, which is designed to control the spectral tilt and the relative amplitudes of lower harmonic components using two parameters, serves as the spectrum shaping filter. Thus, the parameters have a direct effect on the waveform and its spectral shape. Using three kinds of information (F0, power and formant) extracted from the 8 different Japanese words produced by two professional announcers (one male and one female), 80 synthesized speech stimuli were prepared for preference tests. The stimuli were generated by cascade formant synthesizer using 5 different glottal waveform models: the proposed model, Fant's model [Fant, Liljencrants & Lin, 1985], Fujisaki's model [Fujisaki & Ljungqvist, 1986], Klatt's model [Klatt, 1980] and Rosenberg's model [Rosenberg, 1971]. Results of the preference tests with 20 subjects by the proposed model are as good as those of the Fant and Fujisaki models.