TR-IT-0137 :1995.11

アンドリュー ハント, アランWブラック

An Investigation of the Quality of Concatenation of Speech Waveforms

Abstract:Three unit concatenative speech synthesis systems are currently being developed or supported by ATR Interpreting Telecommunications Research Laboratories: ATR ν-Talk, CHATR and SUBPHONET. An important requirement for the generation of high quality synthetic speech by these systems is that the joining of units produces smooth output. The research presented here aims to develop signal processing estimates of the quality of concatenation. A tightly controlled perceptual experiment was carried out to obtain subjective judgement of the quality of isolated words produced by concatenation of two units (i.e. with only one concatenation point). A range of standard signal processing measures was evaluated for the ability to predict the subjective judgements. These measures included power, fundamental frequency, cepstrum and MFCC, two compressed forms of MFCC, and two dynamic variants of MFCC. The experimental results show that MFCC parameters provide the best basis for predicting concatenation quality, and that a combination of power and MFCC can significantly improve upon the use of MFCC alone. Moreover, the compressed versions have similar predictive accuracy, a result which allows us to trade-off predictive accuracy and computational and storage requirements. In comparison to the improved cepstral representation used in ATR ν-Talk and SUBPHONET, a vector quantisation representation of MFCC has slightly better predictive accuracy but requires less than 1% of the space for storage.