TR-IT-0304 :1999

Jing-Dong Chen & Nick Campbell

Objective Distance Measures for Assessing Concatenative Speech Synthesis

Abstract:This report contains two parts. In the first part, Several different acoustic transformations of the speech signal are compared for use in the assessment and evaluation of concatenative speech synthesis. The transformations tested include the LPC, LSP, MFCC, residual MFCC, bispectrum, Mellin transform of the log spectrum, Wigner-Ville distribution, etc. The computed distances between a synthesized utterance and a naturally spoken version of the same sentence are compared by correlation with perceptually-based scores obtained from a MOS evaluation. The results show that the distances computed using the bispectrum have the highest degree of correlation with the MOS score. Both the RMFCC and the LPC outperform the MFCC and the LPCC. The Wigner-Ville distribution based cepstrum is found to behave poorly in this task. A five-level-score evaluation method based on a technique called sluggish coding is proposed in the second part. The experimental results show that with the use of sluggish coding, the method can change a distance obtained from the DTW to a five-level-score which is revealed to have high correlation with the MOS score.