Toward Heart-to-Heart Speech Interface with Robots

1. Introduction
　　This paper discusses technology for heart-to-heart speech interface with robots, a field where notable progress has been made recently. There are various types of robots, from those used on welding lines in manufacturing plants, to cat-like types, to humanoids. This paper focuses on robots familiar to most Japanese: those provided the form of living creatures, including humanoids. Perhaps humans create such robots half expecting them to respond as if they truly were the animals or humans they resemble. In particular, researchers come to expect that one day they will be able to speak to the humanoids and the humanoids will comprehend their intent and respond accordingly in language and action.
　　In real life, with the technology available today, it is almost impossible to have robots respond to speech in the way humans do. Robots are being developed, however, that in their way are able to see, listen, and move. It is also possible, while maintaining a balance among those functions, to continue working to develop the technology needed to draw the robots closer to having human capabilities and having them gradually evolve from their current relatively primitive level to a level sophisticated enough to allow them to communicate.
　　Tremendous progress has been made in recent years with robots concerning certain actions that humans learn automatically. Biped robots, for example, have been taught to walk. When it comes to using language to communicate between humans and robots, however, the existing robots are still at an extremely low level of development. Unfortunately, it is almost hopeless that the language communication functions of robots will one day suddenly rise to the same level as human beings. However if robots, as robots, are brought along gradually and provided an overall balance of functions, they will be given a robot-like language communication capability and will be able to get along with humans without any feeling of great inconvenience. No one expects to talk with an infant in the same manner as talking with an adult, but if robots evolve with their functions developing in parallel, the same way that an infant develops, the resultant robot will seem to be a highly natural human companion.

2. Heart-to-Heart Spoken Language Interface
　　To help understand the sort of interface needed for communicating with a receptive robot, it is first necessary to look at how two human beings express intention when they speak to each other. Figure 1 depicts a typical conversation situation. In order to make one's intention known, humans speak to each other using their voice. Besides through the meaning of words, humans also transfer emotion and emphasis by raising or lowering their pitch - intonation - in speech. Still, words play the most important communications role. As seen in the figure, language is heard and processed in a series of functions: (1) sound is heard (sound retrieval), even from far away, and even in a noisy environment; (2) the speaking person's voice is heard no matter the manner of speaking (speech recognition); (3) the speaking person's intention is comprehended (intention understanding) from among the various possible expressions used to transfer intention; (4) appropriate action is planned in response to content (action generation); (5) a response is made to what is spoken (dialogue control; speech synthesis); and (6) information in facial expressions and gestures is utilized to assist in comprehending content (multi-modal integration). Humans also adjust their response according to the emotions involved (emotion control).

Figure 1. Processing of Robot's Spoken Language Interface.

Distant Sound Retrieval Technology: Using technology related to microphone arrays and signal processing, efforts are being made to have a robot distinguish a user's voice in a noisy environment that contains a high level of reverberations.
Multi-lingual speech recognition robust to various speaking styles: Speech recognition of spontaneous dialogue robust to various speaking styles; besides Japanese, it will also be possible to recognize sounds in languages such as English and Chinese.
Speech detection and recognition with sound and image interpretation: Facial images improves speech detection performance, furthermore, the addition of lip reading technology improves voice recognition performance in noisy environments.
Speech recognition technology: This technology helps a robot to identify and understand the subject of a user's speech.
Action generation technology: This technology allows a robot to automatically acquire dialogue rules and to generate action; examples are the generation of questions, confirmation of content, and responses using synthesized speech based on a dialogue strategy.
Speech synthesis: This technology aims at developing a highly natural multilingual conversation speech synthesis as an interface action.
Talking head generation technology synchronous to speech synthesis: A naturally looking face is generated at the same time that speech sounds are received synchronous to speech synthesis.

　　Much further research remains for almost all the technology needed for interfacing robots with human speech. Besides technology for understanding the content of speech, technology is also needed in areas for generating action, for more sophisticated voice sound recognition, and for recognizing emotion. More vigorous research is required especially for realizing mutual understanding between robots and their users toward heart-to-heart communication.

Table 1. Current State of Speech Interface Technology.

Technology related to the sound retrieval includes research for suppressing noise by using adaptive filters and microphone array signal processing utilizing multiple microphone elements. Multiple elements sometimes enable the sound retrieval performance to surpass that of humans.

If a user speaks clearly it is possible for robots to recognize even difficult words. In a noisy environment, however, and with no special effort made to speak clearly, including changing speaking styles, the recognition performance deteriorates considerably.

This includes research into crude recognition resulting from rough information received on tone in speech. This technology has not yet reached the level where it can be used in conversational speech recognition.

Speech recognition is possible only in very limited areas. The understanding of speech and intent required for general robot tasks will be tackled in the future.

・Generation of dialogue: Dialogues are possible in certain specific domains but are difficult to achieve in general domains and domains that change frequently.
・Speech synthesis: Voice quality has become more natural, but the synthesis of a conversational speech for use in dialogues remains a future task.
・Action generation: Actions have been generated only for very specific situations; the management of, say, integrated speech and gestures, remains a future task.


	NAKAMURA Satoshi Spoken Language Translation Research Laboratories

1. Introduction 　　This paper discusses technology for heart-to-heart speech interface with robots, a field where notable progress has been made recently. There are various types of robots, from those used on welding lines in manufacturing plants, to cat-like types, to humanoids. This paper focuses on robots familiar to most Japanese: those provided the form of living creatures, including humanoids. Perhaps humans create such robots half expecting them to respond as if they truly were the animals or humans they resemble. In particular, researchers come to expect that one day they will be able to speak to the humanoids and the humanoids will comprehend their intent and respond accordingly in language and action. 　　In real life, with the technology available today, it is almost impossible to have robots respond to speech in the way humans do. Robots are being developed, however, that in their way are able to see, listen, and move. It is also possible, while maintaining a balance among those functions, to continue working to develop the technology needed to draw the robots closer to having human capabilities and having them gradually evolve from their current relatively primitive level to a level sophisticated enough to allow them to communicate. 　　Tremendous progress has been made in recent years with robots concerning certain actions that humans learn automatically. Biped robots, for example, have been taught to walk. When it comes to using language to communicate between humans and robots, however, the existing robots are still at an extremely low level of development. Unfortunately, it is almost hopeless that the language communication functions of robots will one day suddenly rise to the same level as human beings. However if robots, as robots, are brought along gradually and provided an overall balance of functions, they will be given a robot-like language communication capability and will be able to get along with humans without any feeling of great inconvenience. No one expects to talk with an infant in the same manner as talking with an adult, but if robots evolve with their functions developing in parallel, the same way that an infant develops, the resultant robot will seem to be a highly natural human companion. 2. Heart-to-Heart Spoken Language Interface 　　To help understand the sort of interface needed for communicating with a receptive robot, it is first necessary to look at how two human beings express intention when they speak to each other. Figure 1 depicts a typical conversation situation. In order to make one's intention known, humans speak to each other using their voice. Besides through the meaning of words, humans also transfer emotion and emphasis by raising or lowering their pitch - intonation - in speech. Still, words play the most important communications role. As seen in the figure, language is heard and processed in a series of functions: (1) sound is heard (sound retrieval), even from far away, and even in a noisy environment; (2) the speaking person's voice is heard no matter the manner of speaking (speech recognition); (3) the speaking person's intention is comprehended (intention understanding) from among the various possible expressions used to transfer intention; (4) appropriate action is planned in response to content (action generation); (5) a response is made to what is spoken (dialogue control; speech synthesis); and (6) information in facial expressions and gestures is utilized to assist in comprehending content (multi-modal integration). Humans also adjust their response according to the emotions involved (emotion control).

Item	Comments
Sound retrieval	Technology related to the sound retrieval includes research for suppressing noise by using adaptive filters and microphone array signal processing utilizing multiple microphone elements. Multiple elements sometimes enable the sound retrieval performance to surpass that of humans.
Speech recognition	If a user speaks clearly it is possible for robots to recognize even difficult words. In a noisy environment, however, and with no special effort made to speak clearly, including changing speaking styles, the recognition performance deteriorates considerably.
Emotion understanding	This includes research into crude recognition resulting from rough information received on tone in speech. This technology has not yet reached the level where it can be used in conversational speech recognition.
Speech understanding	Speech recognition is possible only in very limited areas. The understanding of speech and intent required for general robot tasks will be tackled in the future.
Gesture generation	・Generation of dialogue: Dialogues are possible in certain specific domains but are difficult to achieve in general domains and domains that change frequently. ・Speech synthesis: Voice quality has become more natural, but the synthesis of a conversational speech for use in dialogues remains a future task. ・Action generation: Actions have been generated only for very specific situations; the management of, say, integrated speech and gestures, remains a future task.