Toward Automated Chinese-to-Japanese Interpretation:

Enabling Recognition of Spoken Mandarin Chinese by ATR-MATRIX


Introduction

In March 2000, ATR SLT launched a project to integrate Chinese interpretation into its MATRIX multilingual interpretation system. This initiative was undertaken to address the potentially large market for products providing automatic interpretation between Japanese and Chinese. This paper introduces the development of a front-end prototype of a Chinese speech recognizer.

Chinese differs significantly from Japanese, English and similar languages in its monosyllabic pronunciation system, its phonological use of pitch tones, and its purely ideographic writing system. The application of modern ASR (automatic speech recognition) techniques to Chinese requires special consideration of these characteristics. After investigating sub-word selection, acoustic model development, lexicon design and language modeling, we successfully ported SPREC, a HMnet-based ATR recognizer, to a continuous Chinese recognizer with a 10,000 word vocabulary. This system performed well in continuous speech recognition of ATR's hotel reservation task involving Chinese read aloud from a prepared text.


Characteristics of Chinese Speech

Chinese is a monosyllabic tonal language in which each ideographic character is uttered in the form of a monosyllable associated with a tone having a specific pitch pattern. Each word comprises several ideographic characters. Figure 1 gives examples of four disyllabic words and their pronunciations represented in Pinyin form. A tonal syllable is configured as a segmental base syllable and a pitch tone. The base syllable is traditionally decomposed into an Initial (shengmu) and a Final (yunmu). The Initial is either a consonant or a null, and the Final may be a vowel or vowel compound. Mandarin speech contains 22 Initials and 38 Finals. A Final compound can be further decomposed into a maximum of three phone-like units: glide, nucleus, and coda, as showed by the decomposition of the base syllable liang in Figure 1. The figure clearly shows that the different tones are the only feature discriminating the four words.

This characteristic poses problems for Chinese ASR: pitch production and perception result in a much longer Final segment than an Initial one. Consequently, acoustic matching scores from Initial segments account for a very small portion of those for whole syllables. Syllables with the same Finals are then easily confused, as the discrimination scores from their Initials may not be significant. Tone recognition is also difficult to achieve with current ASR systems.

Another important characteristic of Chinese is the absence of clear word boundaries in sentences1. A sentence is a sequence of ideographic characters without spaces or boundary indications between words. This easily leads to ambiguity in machine processing, as any character may form a word with either of the preceding or succeeding characters. This characteristic, combined with the recognition ambiguities resulting from monosyllabic tonal pronunciations, complicates Chinese ASR. As Figure 2 illustrates, even if little confusion exists regarding syllable candidates assumed from acoustic matchings, the possible word hypothesis graph is already rather large and complex even for a short seven-word phrase.


ATR's Prototype Chinese ASR System

The development of ATR's Chinese ASR system began with the acquisition of Chinese speech databases. Since the objective of the system is to play the role of a travel agent, ATR solicited the assistance of the China Academia Sinica in obtaining the CAS00 speech database, which includes 10 hours of hotel reservation dialogues by 100 speakers. Additionally, we purchased two published speech corpora (HKU96 and 863) for the development of baseline acoustic models.

We used the ATRSPREC recognition engine as the development model for the system. SPREC, originally developed for Japanese, is known for its HMnet (Hidden Markov network), a highly compact acoustic model. It was subsequently ported successfully for English speech recognition. The first version of the Chinese system does not incorporate direct pitch tone recognition, as the system scheme was designed for non-tonal languages. We therefore refer to this version as a "prototype Chinese ASR system."

Phones are the sub-word units normally used in Japanese, English, and many other ASR systems. However, we used 22 Initials and 38 Finals as the sub-word units for our Chinese ASR system. We believe that this approach better models coarticulation acoustics resulting from the monosyllabic characteristic. A phone in a compound Final exhibits markedly different acoustics from a phone in isolation. Our speech recognition experiments have also shown that this approach, when compared with another choice in [2], can achieve a better recognition accuracy than the use of phones.

One problem accompanying the use of Initials and Finals is that the number of triphone-style units exceeds 100,000, making it impossible to collect sufficient data for each allophone separately. Fortunately, we solved this problem with a data-sharing method called "phonetic decision tree-based state tying." This approach enables a number of models to share the same data based on acoustic similarity. We investigated the acoustics of each unit and designed more than 300 phonetic questions intended to cluster their coarticulation effects. For those compound Finals in particular, we asked different questions for the same unit when either carryover or anticipation effects were considered.

At present, we partially circumvent the problem of homophone words by expanding the lexicon of the prototype system with new compound word entries for frequent word concatenations. As a result, the degree of ambiguity of words of more than two syllables is only 1.08 words per base syllable concatenation in our 10,000 word vocabulary. Language models such as n-grams are also efficient in resolving the ambiguities of homophone words, since most have different word properties. In fact, we used a word bigram language model trained from a 300,000 word token text corpus.

Finally, we adapted the baseline acoustic models developed from the HKU96 and 863 speech databases to the task of hotel reservation via a maximum a posteriori (MAP) vector field smoothing (VFS) method using the entire CAS00 database as the adaptation data. The new set of acoustic models performed well for the ATR hotel reservation task.


Further Work

We must address several issues in order to improve the present prototype system. The first is to integrate tone processing into the system, which should make it easier to increase the vocabulary and extend the task domain. The second issue is to accommodate speaker variability, especially dialect accents. China is a large country and pronunciations can vary greatly from place to place. A system developed for people living in northern China would show significant degradation if used to recognize the speech of people living in southern China. Accordingly, appropriate modeling methods must be studied. The third issue is to implement intensive language modeling. This should take the form of a kind of hierarchical structure with the capability to flexibly handle phrase order exchange, synonym substitution, homonym disambiguation, random digital combination, proper name identification, and so forth. The fourth issue is to extend the system to spontaneous speech recognition. This should include not only spontaneous speech data collection but also techniques for handling inherited problems such as interjections in spontaneous speech.


Summary

Here, we briefly introduced language-specific problems in Chinese ASR and our approach to implementing an efficient prototype Chinese ASR system. We are now addressing several issues that we hope will lead to a more realistic and robust system that will enable ATR-MATRIX to provide real Chinese-to-Japanese interpretation in the future.


Reference