Akira Ushioda
Hierarchical Clustering of Words
Abstract:This paper describes a data-driven hierarchical-word-clustering method in which a
large vocabulary of English words (70,000 words) is clustered bottom-up, with respect to
corpora ranging in size from 5 million to 50 million words, using a greedy algorithm that
tries to minimize average loss of mutual information of adjacent classes. The resulting
hierarchical clusters of words are then naturally transformed to a bit-string representation
of (i.e. word bits for) all the words in the vocabulary. Evaluation of the word bits and
word clusters constructed is carried out via two measures: (a) the error rate of the
ATR Decision-Tree Part-Of-Speech Tagger and (b) the perplexity measure of class-based
trigram models on the UPenn Wall Street Journal corpus and ATR corpus. Portability
of word bits from one domain to another is also disscussed.