Shuwu Zhang, Weishan Li
Chinese Language Modeling
based on BTEC Corpus: V1.0
Abstract:This technical report describes the details of BTEC-based Chinese language
modeling V1.0. The procedure of the modeling mainly consists of two steps. The first
step is language preprocessing. In this step, a read-style based word segmentation
and POS tagging tool has been applied for initial word segmentation of spoken-style
BETC text files. Then, two versions of manual word recombination have been
employed based on the initial processing for correcting the errors in segmentation,
especially for some proper nouns such as foreign person names, city and hotel names.
The perplexity changes with manual word recombination have also been investigated.
In the second step, a set of language modeling approaches, including word N-gram,
composite N-gram, multi-class N-gram, and multi-class composite N-gram, has been
compared according to both measures of perplexity and recognition accuracy.
Experimental results showed that multi-class composite N-gram with optimal
configuration is a good language modeling approach for Chinese language as already
testified in Japanese and English languages.