TR-SLT-0052 :September 24, 2003

Shuwu Zhang, Weishan Li

Chinese Language Modeling based on BTEC Corpus: V1.0

Abstract:This technical report describes the details of BTEC-based Chinese language modeling V1.0. The procedure of the modeling mainly consists of two steps. The first step is language preprocessing. In this step, a read-style based word segmentation and POS tagging tool has been applied for initial word segmentation of spoken-style BETC text files. Then, two versions of manual word recombination have been employed based on the initial processing for correcting the errors in segmentation, especially for some proper nouns such as foreign person names, city and hotel names. The perplexity changes with manual word recombination have also been investigated. In the second step, a set of language modeling approaches, including word N-gram, composite N-gram, multi-class N-gram, and multi-class composite N-gram, has been compared according to both measures of perplexity and recognition accuracy. Experimental results showed that multi-class composite N-gram with optimal configuration is a good language modeling approach for Chinese language as already testified in Japanese and English languages.