TR-SLT-0046 :August, 27th 2003

Fadi Badra, Hirofumi Yamamoto

Comparative Study on Multi-Class Composite N-grams Applied to English and Japanese

Abstract:This document will present the research work I conducted in ATR for five months from april to august 2003. My work mainly consisted in an application of Multi-Class Composite N-gram language models, proposed recently by ATR researchers H. Yamamoto, S. Isogai and Y. Sagisaka (Yamamoto, 2001) and only applied to Japanese language so far, to the English language. The purpose of the experiments I conducted was to provide experimental data on such a model for the two languages, and determine to which extend this new technique, which showed good results for Japanese, can be applied as well to the English model. These tests were performed for training corpora of different sizes, extracted from the B.T.E.C., and running these experiments on the two languages in the same conditions enabled us to make a comparison between them. The results showed that Multi-Class language models improve conventional Class-Based ones for English too, but with different optimal connectivity information and only for small training corpora.