Etienne DENOUAL
A method to quantify corpus similarity
and its application to quantifying the degree of
literality in a document
Abstract:Comparing and quantifying corpora is a key issue in corpus based translation and
corpus linguistics, for which there is still a notable lack of measures. This makes it
difficult for a user to isolate, transpose, or extend the interesting features of a corpus to
other NLP systems. In this work we address the issue of measuring similarity
between corpora. We suggest a scale between two user chosen corpora on which any
third given corpus can be assigned a coefficient of similarity, based on the cross-entropy
of statistical N-gram character models. A possible application of this framework is to
quantify similarity in terms of literality (or conversely, orality). To this end we carry out
experiments on several well-known corpora in both English and Japanese language,
and show that the defined similarity coefficient is robust in terms of language and
model order variations. Within this framework we further investigate the notion of
homogeneity in the case of a large multilingual resource.