Jing-Dong Chen & Nick Campbell
Objective Distance Measures for Assessing
Concatenative Speech Synthesis
Abstract:This report contains two parts. In the first part, Several different acoustic transformations
of the speech signal are compared for use in the assessment and evaluation of concatenative speech
synthesis. The transformations tested include the LPC, LSP, MFCC, residual MFCC, bispectrum,
Mellin transform of the log spectrum, Wigner-Ville distribution, etc. The computed distances
between a synthesized utterance and a naturally spoken version of the same sentence are compared
by correlation with perceptually-based scores obtained from a MOS evaluation. The results show
that the distances computed using the bispectrum have the highest degree of correlation with the
MOS score. Both the RMFCC and the LPC outperform the MFCC and the LPCC. The Wigner-Ville
distribution based cepstrum is found to behave poorly in this task.
A five-level-score evaluation method based on a technique called sluggish coding is
proposed in the second part. The experimental results show that with the use of sluggish coding,
the method can change a distance obtained from the DTW to a five-level-score which is revealed to
have high correlation with the MOS score.