


Spoken Language Translation Research Laboratories
Speech Translation Technologies for
Real-world Applications
YAMAMOTO Seiichi, Director
1. Introduction
It has been for many years a common dream of people throughout the world to develop automatic translation technology to enable communication with people of another mother tongue, without language learning. Research on automatic translation technology to help realize this dream had its origins in the 1950s, mainly focusing on the translation of written sentences. The dramatic increase of direct person-to-person communication triggered by international trade and human migration since the 1970s motivated basic research activity in spoken language translation technology. Basic research started in this era has brought outstanding progress in developing the component technologies of speech translation systems, such as speech recognition, language translation, and speech synthesis technologies. Speech translation technology as a whole has reached a stage at which direct sentence-by-sentence translation can be achieved for dialog in limited domains in well-controlled environments. Research focus should now be shifted to certain specific technical issues to realize real-world applications.
We believe that real-world applications of speech translation technologies can be realized by achieving the following three goals: First, we aim to acquire robustness of speech recognition. Speech recognition performance deteriorates in real-world applications due to the effects of ambient noise and variation in speaking styles of users. Second, we will increase the coverage of expressions to be translated. Current language translation is carried out using rules developed for individual language pairs, in particular domains, depending on human insight. Thus it requires a great deal of further development to apply this technology to other domains or to other language pairs. Third, we will create confidence measures for translated results. Conventional language translation technologies lack indices that can quantify the reliability of translation results and users cannot utilize language translation technologies without anxiety about accuracy of translation.
It has been said that we can expect the interactions of people and the exchange of products and information across national borders to further expand in the 21st century, and that one of the greatest obstacles to this expansion is the language barrier. For this reason, there is growing hope for multi-language speech translation systems that will enable communication among people who speak different languages. We aim to realize these hopes in our current research project in the field of speech translation, funded by TAO and begun in January of this year. The goal of our R&D efforts is to bring the speech translation technology available in limited conditions into real-world applications.
Goals for long-term research and for the near future
In order to enable smooth verbal communication between people of different languages, it is necessary to translate their utterances based on an understanding of the intentions of the speakers, their cultural backgrounds, and the context of the dialog. Our ultimate goal is to develop a speech translation system to deal with all of these features.
By using current technologies, however, it is impossible to realize such a speech translation system. Long-term basic research is therefore still required to develop the necessary technology. However, many common expressions can be successfully understood by using sentence-to-sentence direct translation.
Our goal for the near future, as the first stage of long-term research, is to establish the technology to translate spoken language uttered in various real-world environments by using only literal information for each individual sentence. To achieve this goal, we must research and develop speech recognition technologies that are highly robust against variations in ambient noise and speaking styles of users, as well as language translation technologies that can accommodate diverse expressions.
Accordingly, we have set the following targets as the necessary component technologies:
Speech recognition technologies
Current speech recognition technology still performs insufficiently in real environments, especially if speech translation is the target.
The keyword, "real environment", stands for a set of challenges; speaking style variations pose a problem, as well as environmental noise and reverberation. Our research strives to achieve the technology to make speech recognition robust against such environmental problems.
The changes occurring in real environments cannot be fully modeled with explicit rule-based approaches. The most promising approach, which has already achieved a certain amount of success, uses statistical models, which reflect the structures of the changes implicitly. Therefore, we will continue our research by collecting large corpora from real environments from which we can create these statistical models.
Moreover, currently, only local language constraints are used and semantic information is virtually ignored in speech recognition. This can lead to utterances that are locally correct but as a whole make no sense. We address this problem by smoothly combining speech recognition and language processing, for example, analyzing recognition results and arranging the meaningful parts, or using confidence measures to extract only the reliable parts.
Corpus-based language translation technologies
Most of the currently available commercial machine translation systems are rule-based translation systems, in which rules play a central role, mainly because it is difficult to gather data that exhaustively cover diverse language phenomena. In rule-based systems, efforts have been made to improve rules that abstract the language phenomena by using human insight. In taking this type of approach, however, it is difficult to port a particular system to other domains, or to upgrade the system to accommodate new expressions.
With the increased availability of substantial bilingual corpora by the 1980s, corpus-based machine translation (MT) technologies such as example-based MT and stochastic MT were proposed to cope with the limitations of the rule-based systems that had formerly been the dominant paradigm. Since that time, we have conducted research on applying corpus-based methods to speech translation and have developed several technologies.
Our research experience shows us that corpus-based approaches are suitable for speech translation technology. This is because corpus-based methods: (1) can be applied to different domains; (2) are easy to adapt to multiple languages; and (3) can handle ungrammatical sentences, which are common in spoken language.
One of our research themes is to develop example-based translation technologies that can be applied across a wide range of domains, and to develop stochastic translation technologies that can be applied to language pairs with completely different structures, such as English and Japanese. Example-based methods and stochastic methods each have different advantages and disadvantages, and so we plan to combine them into a single more powerful system.
At present, however, corpus-based methods can only be applied to narrow domains due to the lack of sufficiently large bilingual spoken language corpora. Therefore, one of our sub-themes is to establish a methodology for gathering large volumes of data to enable us to translate various expressions at high quality. For this sub-theme, we have started to conduct research on several methods, including paraphrasing, for the creation of huge bilingual corpora, and on methods for evaluating the coverage of the collected corpora.
Corpus-based speech synthesis technologies
In corpus-based speech synthesis, a larger-scale speech corpus enables broader phonological and prosodic diversity, and this offers advantages in sound quality. For this reason, trends in recent years have leaned toward expanding the scale of speech corpora. However, increasing corpus size leads to three significant disadvantages: (1) increasing costs for development of speech synthesis systems; (2) the difficulty of gathering the diverse set of speakers needed to create such corpora; and (3) the difficulty of installing a massive corpus in a mobile information device, due to constraints on memory capacity. Therefore, the evaluation and design of speech corpora are important issues in corpus-based speech synthesis. We need to quantitatively clarify the relationship between sound quality of the synthesized speech and the corpus scale, and we need to develop a method of designing speech corpora so that we can determine the necessary content of a speech corpus when a target domain and a corpus scale are given.
Developing a high-quality algorithm for selecting synthesis units is also an important issue in corpus-based speech synthesis. Such an algorithm should be based on exhaustive perceptual experiments that relate selection and concatenation of waveforms to naturalness degradation.
2. Conclusion
Research at ATR into spoken language translation technologies was established in 1986 in the ATR Interpreting Telephony Research Laboratories and it was continued in the ATR Interpreting Telecommunications Research Laboratories from 1990. In these successive research programs, we attacked various difficult technical issues and developed state-of-the-art component technologies for spoken language translation, such as the invention of corpus-based speech synthesis technologies and significant improvements in the development of example-based translation technologies. As a result, we have now reached a stage where speech translation technologies, which had been considered a distant dream, can now be applied to useful domains, such as hotel reservations, phone inquiries, and placing restaurant orders. Nevertheless, the three main impediments to real world application, described above, still remain. We have begun to address the challenge of resolving these issues in order to apply speech translation technology to real-world applications. We are confident that we can solve these technical problems and make the dream of speech translation a reality.

