Andrew J. Hunt, Alan W. Black
An Investigation of the Quality of
Concatenation of Speech Waveforms
Abstract:Three unit concatenative speech synthesis systems are currently being developed or supported
by ATR Interpreting Telecommunications Research Laboratories: ATR 僚-Talk, CHATR and SUBPHONET. An important requirement for the generation of high quality synthetic speech by these
systems is that the joining of units produces smooth output. The research presented here aims to
develop signal processing estimates of the quality of concatenation. A tightly controlled perceptual experiment was carried out to obtain subjective judgement of the quality of isolated words
produced by concatenation of two units (i.e. with only one concatenation point). A range of standard signal processing measures was evaluated for the ability to predict the subjective judgements.
These measures included power, fundamental frequency, cepstrum and MFCC, two compressed
forms of MFCC, and two dynamic variants of MFCC. The experimental results show that MFCC
parameters provide the best basis for predicting concatenation quality, and that a combination
of power and MFCC can significantly improve upon the use of MFCC alone. Moreover, the compressed versions have similar predictive accuracy, a result which allows us to trade-off predictive
accuracy and computational and storage requirements. In comparison to the improved cepstral
representation used in ATR 僚-Talk and SUBPHONET, a vector quantisation representation of
MFCC has slightly better predictive accuracy but requires less than 1% of the space for storage.