TR-IT-0145

TR-IT-0145 :1995.12

Akira Ushioda

Hierarchical Clustering of Words

Abstract:This paper describes a data-driven hierarchical-word-clustering method in which a large vocabulary of English words (70,000 words) is clustered bottom-up, with respect to corpora ranging in size from 5 million to 50 million words, using a greedy algorithm that tries to minimize average loss of mutual information of adjacent classes. The resulting hierarchical clusters of words are then naturally transformed to a bit-string representation of (i.e. word bits for) all the words in the vocabulary. Evaluation of the word bits and word clusters constructed is carried out via two measures: (a) the error rate of the ATR Decision-Tree Part-Of-Speech Tagger and (b) the perplexity measure of class-based trigram models on the UPenn Wall Street Journal corpus and ATR corpus. Portability of word bits from one domain to another is also disscussed.