Toward Automated Chinese-to-Japanese Interpretation:
Enabling Recognition of Spoken Mandarin Chinese by ATR-MATRIX
Introduction
In March 2000, ATR SLT launched a project to integrate Chinese interpretation
into its MATRIX multilingual interpretation system. This initiative was undertaken
to address the potentially large market for products providing automatic interpretation
between Japanese and Chinese. This paper introduces the development of a front-end
prototype of a Chinese speech recognizer.
Chinese differs significantly from Japanese, English and similar languages in
its monosyllabic pronunciation system, its phonological use of pitch tones, and
its purely ideographic writing system. The application of modern ASR (automatic
speech recognition) techniques to Chinese requires special consideration of these
characteristics. After investigating sub-word selection, acoustic model development,
lexicon design and language modeling, we successfully ported SPREC, a HMnet-based
ATR recognizer, to a continuous Chinese recognizer with a 10,000 word vocabulary.
This system performed well in continuous speech recognition of ATR's hotel reservation
task involving Chinese read aloud from a prepared text.
Characteristics of Chinese Speech
Chinese is a monosyllabic tonal language in which each ideographic character is
uttered in the form of a monosyllable associated with a tone having a specific
pitch pattern. Each word comprises several ideographic characters. Figure
1 gives examples of four disyllabic words and their pronunciations represented
in Pinyin form. A tonal syllable is configured as a segmental base syllable and
a pitch tone. The base syllable is traditionally decomposed into an Initial (shengmu)
and a Final (yunmu). The Initial is either a consonant or a null, and the Final
may be a vowel or vowel compound. Mandarin speech contains 22 Initials and 38
Finals. A Final compound can be further decomposed into a maximum of three phone-like
units: glide, nucleus, and coda, as showed by the decomposition of the base syllable
liang in Figure 1. The figure clearly
shows that the different tones are the only feature discriminating the four words.
This characteristic poses problems for Chinese ASR: pitch production and perception
result in a much longer Final segment than an Initial one. Consequently, acoustic
matching scores from Initial segments account for a very small portion of those
for whole syllables. Syllables with the same Finals are then easily confused,
as the discrimination scores from their Initials may not be significant. Tone
recognition is also difficult to achieve with current ASR systems.
Another important characteristic of Chinese is the absence of clear word boundaries
in sentences1.
A sentence is a sequence of ideographic characters without spaces or boundary
indications between words. This easily leads to ambiguity in machine processing,
as any character may form a word with either of the preceding or succeeding characters.
This characteristic, combined with the recognition ambiguities resulting from
monosyllabic tonal pronunciations, complicates Chinese ASR. As Figure
2 illustrates, even if little confusion exists regarding syllable candidates
assumed from acoustic matchings, the possible word hypothesis graph is already
rather large and complex even for a short seven-word phrase.
ATR's Prototype Chinese ASR System
The development of ATR's Chinese ASR system began with the acquisition of Chinese
speech databases. Since the objective of the system is to play the role of a travel
agent, ATR solicited the assistance of the China Academia Sinica in obtaining
the CAS00 speech database, which includes 10 hours of hotel reservation dialogues
by 100 speakers. Additionally, we purchased two published speech corpora (HKU96
and 863) for the development of baseline acoustic models.
We used the ATRSPREC recognition engine as the development model for the system.
SPREC, originally developed for Japanese, is known for its HMnet (Hidden Markov
network), a highly compact acoustic model. It was subsequently ported successfully
for English speech recognition. The first version of the Chinese system does not
incorporate direct pitch tone recognition, as the system scheme was designed for
non-tonal languages. We therefore refer to this version as a "prototype Chinese
ASR system."
Phones are the sub-word units normally used in Japanese, English, and many other
ASR systems. However, we used 22 Initials and 38 Finals as the sub-word units
for our Chinese ASR system. We believe that this approach better models coarticulation
acoustics resulting from the monosyllabic characteristic. A phone in a compound
Final exhibits markedly different acoustics from a phone in isolation. Our speech
recognition experiments have also shown that this approach, when compared with
another choice in [2], can achieve a better
recognition accuracy than the use of phones.
One problem accompanying the use of Initials and Finals is that the number of
triphone-style units exceeds 100,000, making it impossible to collect sufficient
data for each allophone separately. Fortunately, we solved this problem with a
data-sharing method called "phonetic decision tree-based state tying." This approach
enables a number of models to share the same data based on acoustic similarity.
We investigated the acoustics of each unit and designed more than 300 phonetic
questions intended to cluster their coarticulation effects. For those compound
Finals in particular, we asked different questions for the same unit when either
carryover or anticipation effects were considered.
At present, we partially circumvent the problem of homophone words by expanding
the lexicon of the prototype system with new compound word entries for frequent
word concatenations. As a result, the degree of ambiguity of words of more than
two syllables is only 1.08 words per base syllable concatenation in our 10,000
word vocabulary. Language models such as n-grams are also efficient in resolving
the ambiguities of homophone words, since most have different word properties.
In fact, we used a word bigram language model trained from a 300,000 word token
text corpus.
Finally, we adapted the baseline acoustic models developed from the HKU96 and
863 speech databases to the task of hotel reservation via a maximum a posteriori
(MAP) vector field smoothing (VFS) method using the entire CAS00 database as the
adaptation data. The new set of acoustic models performed well for the ATR hotel
reservation task.
Further Work
We must address several issues in order to improve the present prototype system.
The first is to integrate tone processing into the system, which should make it
easier to increase the vocabulary and extend the task domain. The second issue
is to accommodate speaker variability, especially dialect accents. China is a
large country and pronunciations can vary greatly from place to place. A system
developed for people living in northern China would show significant degradation
if used to recognize the speech of people living in southern China. Accordingly,
appropriate modeling methods must be studied. The third issue is to implement
intensive language modeling. This should take the form of a kind of hierarchical
structure with the capability to flexibly handle phrase order exchange, synonym
substitution, homonym disambiguation, random digital combination, proper name
identification, and so forth. The fourth issue is to extend the system to spontaneous
speech recognition. This should include not only spontaneous speech data collection
but also techniques for handling inherited problems such as interjections in spontaneous
speech.
Summary
Here, we briefly introduced language-specific problems in Chinese ASR and our
approach to implementing an efficient prototype Chinese ASR system. We are now
addressing several issues that we hope will lead to a more realistic and robust
system that will enable ATR-MATRIX to provide real Chinese-to-Japanese interpretation
in the future.
Reference

