David Carter
English Text Representation
and Part-of-Speech Tagging Issues
Abstract:This report describes an investigation of the English travel (BTEC) corpus and the part-of-speech (POS) tagging scheme currently in use at ATR. I first examine the role of
POS tagging in a speech understanding system for English, where POS tag ambiguity is
relatively high, and suggest that tagging can be very useful for translation but less
so as an enhancement to the speech recognizer's language model. Next, attention is
given to issues of word representation, principally those of when a multi-word phrase
should be treated as a single lexical unit for the purpose of tagging. I then compare the
ATR and University of Pennsylvania (Penn) tagging schemes on the basis of a number
of criteria, and propose revisions to the ATR scheme to make it more suited to its
purpose. Finally, I look at ways in which tagged texts from ATR and elsewhere can
be used to speed up the process of manual annotation and correction. The intention is
that all the elements of this report can be used as the basis for writing software to
enhance the accuracy and usefulness of the POS tags used for BTEC and any future
English-language resources developed at ATR.