HONDA Kiyoshi
Department of Biophysical Imaging
Human Information Science Laboratories



1. Introduction
It is commonly believed that human speech conveys linguistic and para-linguistic information. Additionally, speech sounds offer an elementary but underexplored feature that arises from the human body. This is a biological function of speech and plays a critical part in our communication through sounds. One example of this function is speaker characteristics. Why does the voice of each speaker sound so unique? According to current theories, the one-dimensional pattern of the vocal tract shape determines the resonance of a vowel; this, however, casts little insight into the variability of vowel sounds across speakers. Another example is invariance of speech sounds. Why do adult and juvenile speakers share the same vowel sound? Significant differences in size and shape of the developing vocal organs result in totally different physical properties of vowels, but the sounds exchanged mutually fall into the same vowel category. Although we so far have no answers to the above questions on the variability and invariance of vowels, the present author believes that they are an inter-related issue because they both derive from the unique characteristics of the human body.
 If the questions above are to be resolved, we need to obtain a realistic model of human speech functions. With such a model, we could accomplish a basis of technology for speech synthesis and recognition based on human biological functions. Aiming at this target, the authorユs group continues research toward the reality of human speech production mechanisms by employing Magnetic Resonance Imaging (MRI).

2. Magnetic Resonance Imaging (MRI)
We are conducting our research focusing on three topics, with consideration given to the biological aspects of speech: measurement of speech-organ geometry, modeling of speech production mechanisms, and functional brain imaging during speech production. We are employing Magnetic Resonance Imaging (MRI) with the equipment installed at the ATR Brain Activity Imaging Center to perform the studies. We briefly introduce the results so far obtained below.

2.1 Measurement of speech-organ geometry
The shape of the vocal tract during speech had been observed by X-ray radiography, though recently MRI has been available for viewing cross-sectional images of the body. Particularly, its three-dimensional (3D) imaging is ideal to speech studies, and its high- resolution imaging also helps for the smaller organs. Figure 1(a) shows the vocal tract for the Japanese vowel /e/ from 3D-MRI data. The real vocal tract thus observed has small and large side branches, which must affect vowel spectra. Figure 1(b) shows a result from high-resolution MRI of the larynx, with which the mechanism of vocal frequency control could be explored. This advancing imaging technique has further enabled 3D motion imaging. Figure 1(c) displays a sample of a time series of vocal tract area functions during an utterance of Japanese /aiueo/. The data promise direct measurement of cross-sectional area and volume of various parts of the vocal tract [1].


     (a) Shape of vocal tract    (b) Laryngeal cartilages   



(c) Time sequence of vocal tract area function
Fig. 1. Visualization of speech organs by MRI.



2.2 Causal factors of speaker characteristics

In the current project, we are taking advantage of MRI to determine the causal factors of speaker characteristics in the vocal tract. Several acoustic studies have revealed that individual differences are found both in the lower and higher frequency regions. Therefore, the corresponding variations must be identified in the vocal tract. The current result achieved by our MRI measurement for adult male speakers can be summarized as follows.
● The pharyngeal cavity volume is correlated with variation of the lower formants.
● The laryngeal cavity functions as a Helmholtz resonator to modify the fourth formant at 3.5 kHz.
● The piriform fossa provides anti-resonance and causes variation of the spectral pattern at a 4-5 kHz region.
  Since these phenomena are incongruent with current speech production theory, a new model must be introduced. Toward such a model, we propose three hypotheses obtained from our experiments. First, hypopharyngeal cavities and the vocal tract proper are acoustically independent. Second, not only the length but also the volume of each part of the vocal tract controls the lower vowel formants. Third, resonance characteristics of the laryngeal cavity and piriform fossa (hypopharyngeal resonance) determine spectral shapes at higher frequencies. Accordingly, we have developed an acoustic model of vowel production as shown in Figure 2. This model can better describe spectral envelope of speech sounds and explain causal factors of speaker characteristics as well [2].

Fig. 2. Structure of hypopharynx (top) and acoustic model of vowel production
with hypoharyngeal-cavity coupling (bottom).



Fig. 3. Activity of left anterior insula during speech production.

2.3 Brain activity measurement during speech
Brain function during speech production has been investigated by function MRI. The view of brain regions involved in speech production has been updated from the classical idea centered at Brocaユs area: currently more attention is paid to the left anterior insula because this region is selectively involved in the disorder of speech motor programming. However, functions of the insula are not fully understood, partly because this region is located beneath the central sulcus. Following clinical studies, many researchers conducted functional imaging experiments to show evidence of involvement of the insula in speech production.
  Our results as well as those of others were not always consistent, however; the insula was often silent in repetitive tasks and tended to be more active in non-repetitive tasks. Figure 3 shows an example of the data that were obtained while the task (three-syllable word) was renewed at each repetition of the utterance. It appears that in repetitive tasks, the processing from sound patterns to motor programs is done only once at the first utterance, while in non-repetitive tasks, it is done at each utterance. In other words, phonological representation of the syllable sequence is decoded into motor patterns when a new task is presented [3]. Our tentative conclusion is that the decoding of a syllable sequence may be carried out in the left anterior insula.

References
[1] Shimada, Y., Fujimoto, I., Takemoto, H., Takano, S., Honda, K., & Takeo, I. (2002). 4D-MRI using synchronized sampling method (SSM). Japanese Journal of Radiological Technology 58 (12), 1592-1598.
[2] Honda, K., Takemoto, H., Kitamura, T., Fujita, S., & Takano, S. (2004). Exploring human speech production mechanisms by MRI. IEEE Trans., Inf., & Syst., E87-D, 1050-1058.
[3] Nota, Y., & Honda, K. (2003). Possible role of the anterior insula in articulation. Proc. 6th Int. Seminar on Speech Production, 191-194.