TR-SLT-0043 :2003.06.27

David Carter

English Text Representation and Part-of-Speech Tagging Issues

Abstract:This report describes an investigation of the English travel (BTEC) corpus and the part-of-speech (POS) tagging scheme currently in use at ATR. I first examine the role of POS tagging in a speech understanding system for English, where POS tag ambiguity is relatively high, and suggest that tagging can be very useful for translation but less so as an enhancement to the speech recognizer's language model. Next, attention is given to issues of word representation, principally those of when a multi-word phrase should be treated as a single lexical unit for the purpose of tagging. I then compare the ATR and University of Pennsylvania (Penn) tagging schemes on the basis of a number of criteria, and propose revisions to the ATR scheme to make it more suited to its purpose. Finally, I look at ways in which tagged texts from ATR and elsewhere can be used to speed up the process of manual annotation and correction. The intention is that all the elements of this report can be used as the basis for writing software to enhance the accuracy and usefulness of the POS tags used for BTEC and any future English-language resources developed at ATR.