NAKAMURA Satoshi
Spoken Language Translation Research Laboratories



1. Introduction
  This paper discusses technology for heart-to-heart speech interface with robots, a field where notable progress has been made recently. There are various types of robots, from those used on welding lines in manufacturing plants, to cat-like types, to humanoids. This paper focuses on robots familiar to most Japanese: those provided the form of living creatures, including humanoids. Perhaps humans create such robots half expecting them to respond as if they truly were the animals or humans they resemble. In particular, researchers come to expect that one day they will be able to speak to the humanoids and the humanoids will comprehend their intent and respond accordingly in language and action.
  In real life, with the technology available today, it is almost impossible to have robots respond to speech in the way humans do. Robots are being developed, however, that in their way are able to see, listen, and move. It is also possible, while maintaining a balance among those functions, to continue working to develop the technology needed to draw the robots closer to having human capabilities and having them gradually evolve from their current relatively primitive level to a level sophisticated enough to allow them to communicate.
  Tremendous progress has been made in recent years with robots concerning certain actions that humans learn automatically. Biped robots, for example, have been taught to walk. When it comes to using language to communicate between humans and robots, however, the existing robots are still at an extremely low level of development. Unfortunately, it is almost hopeless that the language communication functions of robots will one day suddenly rise to the same level as human beings. However if robots, as robots, are brought along gradually and provided an overall balance of functions, they will be given a robot-like language communication capability and will be able to get along with humans without any feeling of great inconvenience. No one expects to talk with an infant in the same manner as talking with an adult, but if robots evolve with their functions developing in parallel, the same way that an infant develops, the resultant robot will seem to be a highly natural human companion.

2. Heart-to-Heart Spoken Language Interface
  To help understand the sort of interface needed for communicating with a receptive robot, it is first necessary to look at how two human beings express intention when they speak to each other. Figure 1 depicts a typical conversation situation. In order to make one's intention known, humans speak to each other using their voice. Besides through the meaning of words, humans also transfer emotion and emphasis by raising or lowering their pitch - intonation - in speech. Still, words play the most important communications role. As seen in the figure, language is heard and processed in a series of functions: (1) sound is heard (sound retrieval), even from far away, and even in a noisy environment; (2) the speaking person's voice is heard no matter the manner of speaking (speech recognition); (3) the speaking person's intention is comprehended (intention understanding) from among the various possible expressions used to transfer intention; (4) appropriate action is planned in response to content (action generation); (5) a response is made to what is spoken (dialogue control; speech synthesis); and (6) information in facial expressions and gestures is utilized to assist in comprehending content (multi-modal integration). Humans also adjust their response according to the emotions involved (emotion control).



Figure 1. Processing of Robot's Spoken Language Interface.

  The current technology does not allow robots to reach the levels of human beings in any of these functions. When it comes to a human and a robot sharing the same knowledge and background of daily life, where the robot must understand its user's thinking patterns and social values and must respond by comprehending its user's intention through a few words, the difficulties become extreme. In fact, even human beings sometimes have difficulty interacting with each other in such complex situations.

3. Research at Spoken Language Translation Research Laboratories
  Table 1 shows in simplified form the current level of robot-related speech interface technologies and their functions. As seen, current technology is unfortunately not yet developed sufficiently to allow robots to recognize spontaneously spoken speech. Researchers, however, especially in Japan, are now studying humanoids with emphasis on communication. In fact, Japan leads the world in the area of intelligent robots and it is no exaggeration to say that all related technology exists in Japan.
  Researchers at the Spoken Language Translation Research Laboratories continue to study the technology needed for developing spoken language communication. It is indeed possible for this technology to be developed into important core technology not only for translating sounds but also for heart-to-heart communication between humans and robots. The Spoken Language Translation Research Laboratories are currently studying the following seven areas of technology with regards to heart-to-heart speech interface with robots.

  1. Distant Sound Retrieval Technology: Using technology related to microphone arrays and signal processing, efforts are being made to have a robot distinguish a user's voice in a noisy environment that contains a high level of reverberations.
  2. Multi-lingual speech recognition robust to various speaking styles: Speech recognition of spontaneous dialogue robust to various speaking styles; besides Japanese, it will also be possible to recognize sounds in languages such as English and Chinese.
  3. Speech detection and recognition with sound and image interpretation: Facial images improves speech detection performance, furthermore, the addition of lip reading technology improves voice recognition performance in noisy environments.
  4. Speech recognition technology: This technology helps a robot to identify and understand the subject of a user's speech.
  5. Action generation technology: This technology allows a robot to automatically acquire dialogue rules and to generate action; examples are the generation of questions, confirmation of content, and responses using synthesized speech based on a dialogue strategy.
  6. Speech synthesis: This technology aims at developing a highly natural multilingual conversation speech synthesis as an interface action.
  7. Talking head generation technology synchronous to speech synthesis: A naturally looking face is generated at the same time that speech sounds are received synchronous to speech synthesis.

  Much further research remains for almost all the technology needed for interfacing robots with human speech. Besides technology for understanding the content of speech, technology is also needed in areas for generating action, for more sophisticated voice sound recognition, and for recognizing emotion. More vigorous research is required especially for realizing mutual understanding between robots and their users toward heart-to-heart communication.

Table 1. Current State of Speech Interface Technology.

Item
Comments
Sound retrieval
Technology related to the sound retrieval includes research for suppressing noise by using adaptive filters and microphone array signal processing utilizing multiple microphone elements. Multiple elements sometimes enable the sound retrieval performance to surpass that of humans.
Speech recognition
If a user speaks clearly it is possible for robots to recognize even difficult words. In a noisy environment, however, and with no special effort made to speak clearly, including changing speaking styles, the recognition performance deteriorates considerably.
Emotion understanding
This includes research into crude recognition resulting from rough information received on tone in speech. This technology has not yet reached the level where it can be used in conversational speech recognition.
Speech understanding
Speech recognition is possible only in very limited areas. The understanding of speech and intent required for general robot tasks will be tackled in the future.
Gesture generation

・Generation of dialogue: Dialogues are possible in certain specific domains but are difficult to achieve in general domains and domains that change frequently.
・Speech synthesis: Voice quality has become more natural, but the synthesis of a conversational speech for use in dialogues remains a future task.
・Action generation: Actions have been generated only for very specific situations; the management of, say, integrated speech and gestures, remains a future task.


4. Conclusion
  What types of robots will society need in the future? Surely, representative robots in need will be those for elderly care, those that act as conversational partners, and those that follow the instructions of their users. Although it would obviously be dangerous to design overly intelligent robots, today's robots are still severely lacking in terms of their interface with humans. The interface between robots and humans is a field of study that requires much further research.