Arian Halber
Capturing Long Distance Dependencies from Parsed
Corpora
Abstract:Purpose
We are aiming at improving Speech Recognition using specific language modeling;
we want to reveal Long Distance Dependencies, which are obvious to humans but
completely ignored by bi- or tri- gram models.
To track LDD automatically, yet to keep them swift and consistent, we propose to
use pre-parsed data.
Study
We propose to
- Reckon the dependencies automatically
- Use them as statistical predictors
- Evaluate their efficiency for the recognition task
The study is based on the Penn Tree Bank, a corpus of syntactally parsed data. We
define two rules of LDDs, Brother and Parent, and extract them, along with
Bigrams, from a Training set. We studied particularly the ATIS corpus, despite its
small size, for its well aimed quality.
We estimate and compare Perplexities of (Bigrams+LDD) and (Bigrams only)
models, it quantifies how much LDDs relieve the recognition task. We obtain
roughly 8% improvement on Testing set.
Conclusion
Brothers have little influence on Perplexity; though consistent, as shown by their
Weight, they are still too scarce. Parents are more common and their consistency
capture Information thus improving Speech Recognition.