Seiichi TENPAKU and Tatsuya HIRAHARA
A glottal waveform model for high
quality speech synthesis
Abstract:A new glottal waveform model for high quality speech synthesis is
proposed and the results of the perceptual evaluations for synthesized
speech using the proposed model and other models are compared. The
proposed glottal waveform model consists of two parts; a waveform
generator and a spectrum shaping filter. A third order polynomial,
whose coefficients are determined by combinations of open quotient
(OQ), speed quotient (SQ), amplitude of voicing (AV) and fundamental
frequency (F0), is used for the waveform generator. A second order
infinite impulse response (IIR) filter, which is designed to control the
spectral tilt and the relative amplitudes of lower harmonic components
using two parameters, serves as the spectrum shaping filter. Thus, the
parameters have a direct effect on the waveform and its spectral shape.
Using three kinds of information (F0, power and formant) extracted from
the 8 different Japanese words produced by two professional announcers
(one male and one female), 80 synthesized speech stimuli were prepared
for preference tests. The stimuli were generated by cascade formant
synthesizer using 5 different glottal waveform models: the proposed
model, Fant's model [Fant, Liljencrants & Lin, 1985], Fujisaki's model
[Fujisaki & Ljungqvist, 1986], Klatt's model [Klatt, 1980] and Rosenberg's
model [Rosenberg, 1971]. Results of the preference tests with 20 subjects
by the proposed model are as good as those of the Fant and Fujisaki
models.