TR-IT-0096 :1994.12

Arian Halber

Capturing Long Distance Dependencies from Parsed Corpora

Abstract:Purpose

We are aiming at improving Speech Recognition using specific language modeling; we want to reveal Long Distance Dependencies, which are obvious to humans but completely ignored by bi- or tri- gram models. To track LDD automatically, yet to keep them swift and consistent, we propose to use pre-parsed data.

Study

We propose to

The study is based on the Penn Tree Bank, a corpus of syntactally parsed data. We define two rules of LDDs, Brother and Parent, and extract them, along with Bigrams, from a Training set. We studied particularly the ATIS corpus, despite its small size, for its well aimed quality. We estimate and compare Perplexities of (Bigrams+LDD) and (Bigrams only) models, it quantifies how much LDDs relieve the recognition task. We obtain roughly 8% improvement on Testing set.

Conclusion

Brothers have little influence on Perplexity; though consistent, as shown by their Weight, they are still too scarce. Parents are more common and their consistency capture Information thus improving Speech Recognition.