TR-I-0287 :1992.11.5

Paul BAGSHAW

Automated Labelling of Prosodic Aspects of English Speech: Final Report

Abstract:Approaches to automatically transcribe the variations in acoustic parameters related to prosodic events in English speech are investigated. The events of interest are the relative prominence of syllables in continuous speech and pitch movements associated with accented syllables. A description is given of a database which contains transcriptions of these prosodic events as perceived by a hand labeller (the "MLP dialogue database"). This database is used in evaluating the performance of the automatic transcription algorithms. An existing algorithm, the JLH syllable prominence labelling algorithm, is evaluated to give a bench mark against which other approaches are compared. The JLH algorithm is limited by describing syllable prominence as a discrete number of categories and gives no indication of the pitch movements associated with each accented syllable. Furthermore, the approach uses arbitrary thresholds and is rule-based. These limitations and the use of all but a few arbitrary thresholds are overcome without a reduction in performance, by abstracting acoustic parameters related to prosodic events. Phones are grouped into syllables automatically by using a sonorant energy contour. The duration and energy of the phones in these syllables are normalised to compensate for variations attributed to phone type. The fundamental frequency contour is stylised by piece-wise linearisation to remove microprosodic variations. Piece-wise units are related to pitch movements and given relative heights. These abstracted features are mapped to syllable prominence labels using rules similar to those used by the JLH algorithm. Alternative rules are sought by using regression trees to model the hand-transcribed syllable prominence labels given a set of abstracted acoustic features. The tree models do not give a significant increase in the correlation between the automatically transcribed syllable prominences and those transcribed by hand. This is evidence that the rules applied to map the abstracted features to syllable prominence labels are not inappropriate. However, there is an indication that modifications should be made to the rules used to locate accented syllables from the pitch movements.