カルロス トロンコッソ アラルコン, 山本博史, 菊井玄一郎
異種コーパスを用いたトリガー言語モデルの適応
Abstract:We present a novel approach to trigger-based language model adaptation for large
vocabulary continuous speech recognition (LVCSR) that uses two different corpora to
construct the set of trigger pairs. In language modeling for LVCSR, when the training
data set is considerably big, it is usually too general and the task dependency is lost. On
the other hand, when the training data are task-dependent, they are usually
insufficient and the probability estimates are unreliable. The proposed approach tries to
overcome this generality-sparseness trade-off problem by first building task-dependent
trigger pairs from a Japanese conversational text corpus, which is the target task, and
then avoiding data sparseness by calculating the likelihoods of the pairs from a huge
text corpus. A small improvement in word recognition accuracy was achieved when
using the two corpora, while accuracy degradation was obtained when we used either
only the conversational text corpus or the huge corpus to both extract the pairs and
calculate their likelihoods.