Curriculum Vitaes

Arai Takayuki

  (荒井 隆行)

Profile Information

Affiliation
Professor, Faculty of Science and Technology, Department of Information and Communication Sciences, Sophia University
Degree
工学士(上智大学)
工学修士(上智大学)
博士(工学)(上智大学)

Contact information
araisophia.ac.jp
Researcher number
80266072
J-GLOBAL ID
200901064275514612
researchmap Member ID
1000260131

Research and professional experience:

2008-present Professor at the Department of Information and Communication Sciences,
Sophia University
2006-2008 Professor at the Department of Electrical
and Electronics Engineering, Sophia University
2003-2004 Visiting Scientist at the Research Lab. of Electronics,
Massachusetts Institute of Technology (Cambridge, MA, USA)
2000-2006 Associate Professor at the Department of Electrical
and Electronics Engineering, Sophia University
1998-2000 Assistant Professor at the Department of Electrical
and Electronics Engineering, Sophia University
1997-1998 Research Fellow at the International Computer Science Institute
/ University of California at Berkeley
(Berkeley, California, USA)
1995-1996 Visiting Scientist at the Department of Electrical Engineering,
Oregon Graduate Institute of Science and Technology
(Portland, Oregon, USA)
1994-1995 Research Associate at the Department of Electrical and
Electronics Engineering, Sophia University
working with Professor Yoshida
1992-1993 Visiting Scientist at the Department of Computer Science
and Engineering, Oregon Graduate Institute of Science and Technology
(Portland, Oregon, USA)

Short-term Visiting Scientist:

2000, August / 2001, August / 2002, August
Massachusetts Institute of Technology (Cambridge, Massachusetts, USA)
2001, March
Max Planck Institute for Psycholinguistics (Nijmegen, the Netherlands)

The series of events involved in speech communication is called “Speech Chain,” and it is a basic concept in the speech and hearing sciences. We focus on research related to speech communication. The fields of this research are wide-ranging, and our interests include the following interdisciplinary areas:
- education in acoustics (e.g., physical models of human vocal tract),
- acoustic phonetics,
- speech and hearing sciences,
- speech production,
- speech analysis and speech synthesis,
- speech signal processing (e.g., speech enhancement),
- speech / language recognition and spoken language processing,
- speech perception and psychoacoustics,
- acoustics for speech disorders,
- speech processing for hearing impaired,
- speaker characteristics in speech, and
- real-time signal processing using DSP processors.

(Subject of research)
General Acoustics and Education in Acoustics (including vocal-tract models)
Acoustic Phonetics, Applied Linguistics
Speech Science (including speech production), Hearing Science (including speech perception), Cognitive Science
Speech Intelligibility, Speech Processing, Speech Emhancement
Assistive Technology related to Acoustics, Speech and Acoustics for Everybody
Speech Processing, Applications related to Acoustics
Speaker Characteristics of Speech

(Proposed theme of joint or funded research)
acoustic signal processing
speech signal processing
auditory signal processing


Research History

 2

Papers

 602
  • T. Arai
    Proc. of INTERSPEECH, 171-172, Sep, 2018  Peer-reviewed
  • T. Arai
    Proc. of International Symposium on Applied Phonetics, 1-4, Sep, 2018  Peer-reviewedInvited
  • T. Arai, E. Osawa, T. Igeta, and N. Hodoshima
    Acoustical Science and Technology, 39(3) 252-255, May, 2018  Peer-reviewed
  • E. Iwagami and T. Arai
    Acoustical Science and Technology, 39(2) 109-118, Mar, 2018  Peer-reviewed
    In this study, two perception experiments were conducted to investigate the misperception of Japanese words with devoiced vowels and/or geminate consonants by young and elderly listeners. In Experiment 1, eight young normal-hearing listeners participated under a white-noise condition, and eight elderly listeners participated in Experiment 2. Two types of word sets which consist of combinations of vowels (V = /i, u/) and voiceless consonants (C = /k, t, s/) were used as stimuli. The first word set involved two- or three-mora words and the second word set had 14 minimal pairs of CVC (:) V, where (:) stands for with or without a geminate consonant. The results of both experiments showed that misperception was great for words with devoiced vowels and even greater for words with geminate consonants. In particularly, the misperception of consonants including high frequency components such as /shi/ or /shu/ was observed for elderly listeners.
  • J. Moore, J. Shaw, S. Kawahara, and T. Arai
    Acoustical Science and Technology, 39(2) 75-83, Mar, 2018  Peer-reviewed
    This study examines the tongue shapes used by Japanese speakers to produce the English liquids/r/ and /l/. Four native Japanese speakers of varying levels of English acquisition and one North American English speaker were recorded both acoustically and with Electromagnetic Articulography. Seven distinct articulation strategies were identified. Results indicate that the least advanced speaker uses a single articulation strategy for both sounds. Intermediate speakers used a wide range of articulations, while the most advanced non-native speaker relied on a single strategy for each sound.
  • T. Arai
    Proc. of the International Workshop on the History of Speech Communication Research, 55-60, Aug, 2017  Peer-reviewed
  • Takayuki Arai
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, -2017 4028-4029, Aug, 2017  Peer-reviewed
    Our physical models of the human vocal tract successfully demonstrate theories such as the source-filter theory of speech production, mechanisms such as the relationship between vocal-Tract configuration and vowel quality, and phenomena such as formant frequency estimation. Earlier models took one of two directions: either simplification, showing only a few target themes, or diversification, simulating human articulation more broadly. In this study, we have designed a static, hybrid model. Each model of this type produces one vowel. However, the model also simulates the human articulators more broadly, including the lips, teeth, and tongue. The sagittal block is enclosed with transparent plates so that the inside of the vocal tract is visible from the outside. We also colored the articulators to make them more easily identified. In testing, we confirmed that the vocal-Tract models can produce the target vowel. These models have great potential, with applications not only in acoustics and phonetics education, but also pronunciation training in language learning and speech therapy in the clinical setting.
  • Takayuki Arai
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, -2017 979-983, Aug, 2017  Peer-reviewed
    We have developed two types of mechanical models of the human vocal tract. The first model was designed for the retroflex approximant [r] and the alveolar lateral approximant [l]. It consisted of the main vocal tract and a flapping tongue, where the front half of the tongue can be rotated against the palate. When the tongue is short and rotated approximately 90 degrees, the retroflex approximant [r] is produced. The second model was designed for [b], [m], and [w]. Besides the main vocal tract, this model contains a movable lower lip for lip closure and a nasal cavity with a controllable velopharyngeal port. In the present study, we joined these two mechanical models to form a new model containing the main vocal tract, the flapping tongue, the movable lower lip, and the nasal cavity with the controllable velopharyngeal port. This integrated model now makes it possible to produce consonant sequences. Therefore, we examined the sequence [br], in particular, adjusting the timing of the lip and lingual gestures to produce the best sound. Because the gestures are visually observable from the outside of this model, the timing of the gestures were examined with the use of a high-speed video camera.
  • T. Igeta, S. Hiroya, T. Arai
    Phonetics and Speech Sciences, 9(1) 1-7, Mar, 2017  Peer-reviewed
  • Takayuki Arai, Eri Iwagami, Emi Yanagisawa
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 141(3) EL319-EL325, Mar, 2017  Peer-reviewed
    This study tests the perception of geminate consonants for native speakers of Japanese using audio and visual information. A previous study showed that formant transitions associated with the closing gesture of articulators at the end of a preceding vowel are crucial for perception of stop geminate consonants in Japanese. In addition, this study further focuses on visual cues, to test if seeing the closing gesture affects perception of geminate consonants. Based on a perceptual experiment, it is observed that visual information can compensate for a deficiency in geminate consonant auditory information, such as formant transitions. (C) 2017 Acoustical Society of America
  • C. T. J. Hui, C. Watson, T. Arai
    Proc. of Speech Science and Technology Conference, 321-324, Dec, 2016  Peer-reviewed
  • Kanako Tomaru, Takayuki Arai
    Acoustical Science and Technology, 37(6) 303-314, Nov, 2016  Peer-reviewed
    The theory of categorical perception of speech sounds traditionally suggests that speech sound discrimination is conducted based on phonemic labeling, which is an abstract speech representation that listeners are hypothesized to have. However, recent research has found that the impact of labeling on perception of an English /-/l/ contrast may depend on surrounding sound contexts: the effects of phonemic labeling may disappear when the speech sounds to be discriminated are presented in a sentence. The purpose of the present research is to investigate (1) the effects of the sound contexts on categorical perception of speech sounds, and (2) cross linguistic extensibility of such an effect. The experiments employed a Japanese voiced stop consonant continuum, i.e., /ba/- /da/, and tested discrimination of sounds on the continuum by native speakers of Japanese. Experiment 2 in particular investigated whether sounds on such a continuum are discriminated in accordance with the labeling when the sound in question is inserted into a sentence. Through experiments, the cross linguistic effects of surrounding sound contexts are found although there may be some exceptional cases. The research proposes reconsideration of the role of labeling mediation in speech perception.
  • Takayuki Arai
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5, 1099-1103, Sep, 2016  Peer-reviewed
    As an extension of a series of models we have developed, a mechanical bent vocal-tract model with nasal cavity was proposed for educational and clinical applications, as well as for understanding human speech production. Although our recent studies have focused on flap and approximant sounds, this paper introduced a new model for the consonants [b], [m] and [w]. Because the articulatory gesture of approximants is slow compared to the more rapid movement of plosives, in our [b] and [m] model, the elastic force of a spring is applied to affect the movement of the lower lip block, as was done for flap sounds in our previous studies. The main difference between [b] and [m] is in the velopharyngeal port, which is closed for [b] and open for [m]. In this study, we concluded that 1) a slower manipulation of the lip block is needed for [w], while 2) [b] and [m] require a faster movement, and finally, 3) close-open coordination of the lip and velopharyngeal gestures is important for [m].
  • Kanae Amino, Takayuki Arai, Fumiaki Satoh, Kentaro Nakamura, Akira Nishimura, Sakae Yokoyama
    Acoustical Science and Technology, 37(4) 178-180, Jul, 2016  Peer-reviewed
    'Summer-Holiday Science Square' is a scientific event organized by the National Museum of Nature and Science Japan to cultivate the science literacy of the citizen. Exhibition booths presented by universities, colleges, research societies, and other associations allow visitors to become familiar with science through experiments, handicrafts, and observations. Handicraft workshops are held ten times a day, with groups of up to five people who have picked up numbered tickets. Even the children younger than school-aged are welcome with parent accompaniment. Participants experience various sound resonance phenomena through experiments in 'Let's experience sound resonance!'. In another exhibit, visitors learn about speech production and the visualization of speech through a series of demonstrations. They first learn the principles of vowel production using vocal tract models. Then they see spectrograms of their own speech on-screen using speech analysis software.
  • Takayuki Arai, Rei Uchida
    Acoustical Science and Technology, 37(4) 175-177, Jul, 2016  Peer-reviewed
  • Takayuki Arai, Takashi Arai
    Acoustical Science and Technology, 37(4) 173-174, Jul, 2016  Peer-reviewed
  • Takayuki Arai
    Acoustical Science and Technology, 37(4) 148-156, Jul, 2016  Peer-reviewedInvited
    This paper describes vocal-tract models that we have developed and their applications in education in acoustics. First, we grouped the representative models into two major categories depending on their configuration: straight vs. bent. Then, within each category, we discussed the characteristics of each model in terms of its degree of freedom of movement. Subsequently, we review lectures using the vocal-tract models and report the results of tests and questionnaires carried out simultaneously with the lectures, some of which were re-evaluated in this paper. On the basis of the review, we further discuss how education should be carried out using the vocal-tract models, and we made the following conclusions: 1) the models are useful not only for education on sounds themselves but also for phonetic education 2) it is important that appropriate models should be selected depending on specific purposes and 3) it is necessary to continuously develop more models having different properties with wider variations in the future.
  • Mako Ishida, Arthur G. Samuel, Takayuki Arai
    COGNITION, 151 68-75, Jun, 2016  Peer-reviewed
    People can understand speech under poor conditions, even when successive pieces of the waveform are flipped in time. Using a new method to measure perception of such stimuli, we show that words with sounds based on rapid spectral changes (stop consonants) are much more impaired by reversing speech segments than words with fewer such sounds, and that words are much more resistant to disruption than pseudowords. We then demonstrate that this lexical advantage is more characteristic of some people than others. Participants listened to speech that was degraded in two very different ways, and we measured each person's reliance on lexical support for each task. Listeners who relied on the lexicon for help in perceiving one kind of degraded speech also relied on the lexicon when dealing with a quite different kind of degraded speech. Thus, people differ in their relative reliance on the speech signal versus their pre-existing knowledge. (C) 2016 Elsevier B.V. All rights reserved.
  • Mako Ishida, Takayuki Arai
    SPRINGERPLUS, 5(1), Jun, 2016  Peer-reviewed
    This study investigates how similarly present and absent English phonemes behind noise are perceived by native and non-native speakers. Participants were English native speakers and Japanese native speakers who spoke English as a second language. They listened to English words and non-words in which a phoneme was covered by noise (added; phoneme + noise) or replaced by noise (replaced; noise only). The target phoneme was either a nasal (/m/ and /n/) or a liquid (/l/ and /r/). In experiment, participants listened to a pair of a word (or non-word) with noise (added or replaced) and a word (or non-word) without noise (original) in a row, and evaluated the similarity of the two on an eight-point scale (8: very similar, 1: not similar). The results suggested that both native and non-native speakers perceived the 'added' phoneme more similar to the original sound than the 'replaced' phoneme to the original sound. In addition, both native and non-native speakers restored missing nasals more than missing liquids. In general, a replaced phoneme was better restored in words than non-words by native speakers, but equally restored by non-native speakers. It seems that bottom-up acoustic cues and top-down lexical cues are adopted differently in the phonemic restoration of native and non-native speakers.
  • N. Hodoshima, T. Arai, K. Kurisu
    Proc. of the Western Pacific Acoustics Conference (WESPAC), 93-97, Dec, 2015  
  • H. Kawata, T. Arai, K. Yasu, K. Kobayashi, M. Shindo
    Journal of the Acoustical Society of Japan, 71(12) 653-600, Dec, 2015  
  • M. Kasuya, T. Arai
    Journal of the Acoustical Society of Japan, 71(1) 7-13, Dec, 2015  
  • 木村琢也, 荒井隆行
    日本音声学会全国大会予稿集, 80-85, Oct, 2015  
  • E. Yanagisawa, T. Arai
    Journal of the Acoustical Society of Japan, 71(10) 505-515, Oct, 2015  
  • T. Arai, T. Kimura, R. Uchida
    Proc. Autumn Meet. Acoust. Soc. Jpn, 345-348, Sep, 2015  
  • A. Nakajima, T. Arai
    Proc. Autumn Meet. Acoust. Soc. Jpn, 1033-1036, Sep, 2015  
  • Y. Amimoto, T. Arai
    Proc. Autumn Meet. Acoust. Soc. Jpn, 491-494, Sep, 2015  
  • J. Moore, T. Arai
    Proc. Autumn Meet. Acoust. Soc. Jpn, 259-260, Sep, 2015  
  • N. Hodoshima, T. Arai, K. Kurisu
    Proc. Autumn Meet. Acoust. Soc. Jpn, 415-416, Sep, 2015  
  • K. Tomaru, Takayuki Arai
    Proc. Autumn Meet. Acoust. Soc. Jpn, 323-326, Sep, 2015  
  • M. Ishida, T. Arai
    Proc. of INTERSPEECH, 3408-3411, Sep, 2015  
  • T. Sannoh, T. Arai, K. Yasu
    Journal of the Acoustical Society of Japan, 71(8) 382-389, Aug, 2015  
  • T. Arai
    Annual Meeting of Sophia University Linguistic Society (Invited Talk), Jul, 2015  Invited
  • N. Hodoshima, T. Arai, K. Kurisu
    15(5) 443-447, Jul, 2015  
  • Maria Chait, Steven Greenberg, Takayuki Arai, Jonathan Z. Simon, David Poeppel
    FRONTIERS IN NEUROSCIENCE, 9(214) 1-10, Jun, 2015  
    How speech signals are analyzed and represented remains a foundational challenge both for cognitive science and neuroscience. A growing body of research, employing various behavioral and neurobiological experimental techniques, now points to the perceptual relevance of both phoneme-sized (10-40 Hz modulation frequency) and syllable-sized (2-10 Hz modulation frequency) units in speech processing. However, it is not clear how information associated with such different time scales interacts in a manner relevant for speech perception. We report behavioral experiments on speech intelligibility employing a stimulus that allows us to investigate how distinct temporal modulations in speech are treated separately and whether they are combined. We created sentences in which the slow (similar to 4 Hz; S-low) and rapid (similar to 33 Hz; S-high) modulations-corresponding to similar to 250 and similar to 30 ms, the average duration of syllables and certain phonetic properties, respectively were selectively extracted. Although Slow and Shigh have low intelligibility when presented separately, dichotic presentation of Shigh with Slow results in supra-additive performance, suggesting a synergistic relationship between low- and high-modulation frequencies. A second experiment desynchronized presentation of the Slow and Shigh signals. Desynchronizing signals relative to one another had no impact on intelligibility when delays were less than 45 ms. Longer delays resulted in a steep intelligibility decline, providing further evidence of integration or binding of information within restricted temporal windows. Our data suggest that human speech perception uses multi-time resolution processing. Signals are concurrently analyzed on at least two separate time scales, the intermediate representations of these analyses are integrated, and the resulting bound percept has significant consequences for speech intelligibility a view compatible with recent insights from neuroscience implicating multi-timescale auditory processing.
  • H. Kawata, T. Arai, K. Yasu, E. Yanagisawa
    1313-1316, May, 2015  
  • T. Arai
    Proc. Spring Meet. Acoust. Soc. Jpn, 1375-1378, May, 2015  Invited
  • T. Arai, M. Budhiantho
    Proc. Spring Meet. Acoust. Soc. Jpn, 1291-1294, May, 2015  
  • 渡邉明, 吉畑博代, 進藤美津子, 荒井隆行
    日本コミュニケーション障害学会学術講演会予稿集, 64-64, May, 2015  
  • Takayuki Arai
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2605-2606, 2015  
    The physical model designed by Umeda and Teranishi simulating an arbitrary shape of the human vocal tract was a straight tube with a set of plastic plates inserted from one side. Although this model has the advantage that users can configure any shape of the vocal tract, manually manipulating several plates simultaneously is difficult. In this study, we present two models extending Umeda and Teranishi's work to overcome this disadvantage. The first model has a straight tube similar to the original Umeda and Teranishi's model, but the weight of the plates enables them to return to resting position automatically. The second model has a bent tube with the oral and pharyngeal cavities connected at 90 degrees. This feature simulates the actual human vocal tract. The plates move back to their original positions by means of spring coils. In both cases, the plates' automatic return movement facilitates manual manipulation as compared to the original Umeda and Teranishi's model.
  • Takayuki Arai
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 1695-1699, 2015  
    We applied a physical model of the human vocal tract, which was originally designed for simulating English In, and tested whether the model can produce a certain range of vowels, especially mid front vowels. We first confirmed that the model can produce such vowels with high intelligibility. By changing the tongue height of the model, learners can adjust the vowel quality by listening to the output sounds as well as receiving a tactile sensation. Therefore, we further used the model for the pronunciation training as a hands-on tool for phonetic education based on the consideration that tongue and finger movements are related in terms of motor control. We demonstrated the vowel production using the model and received feedbacks from a group of listeners engaged in phonetic education. The synergetic effect of visual, auditory and tactile sensations was pointed out as an advantage. We then conducted a production experiment, where participants were asked to repeat each vowel they heard and produce that vowel by means of manipulating the vocal-tract model. As results, slight training effects were observed when using the physical model. Specifically, formant frequencies approached target frequencies as the experimental session progressed.
  • Hinako Masuda, Takayuki Arai, Shigeto Kawahara
    Acoustical Science and Technology, 36(1) 31-34, 2015  
    A study was conducted to establish correlation between English proficiency and consonant identification ability during identification of English consonants in intervocalic contexts in multi-speaker babble noise by Japanese listeners. TOEIC scores and the number of months spent living in English-speaking countries were used to measure the Japanese participants' English proficiency. TOEIC was used to measure the Japanese participants? English proficiency, as it was widely used in Japanese universities as a means to evaluate students' English proficiency. The number of months spent in English-speaking countries was also used to investigate the correlation between English consonant identification under the assumption that listeners with a longer experience of residing in English-speaking countries were more used to and in need of perceiving English speech under a more or less adverse listening condition.
  • Keiichi Yasu, Takayuki Arai, Tatsunori Endoh, Kei Kobayashi, Mitsuko Shindo
    Acoustical Science and Technology, 36(1) 35-38, 2015  
    A study was conducted to investigate the relationship between broadening of auditory filter bandwidth near 2 kHz and speech discrimination/identification of monosyllables/de/and/ge by young and elderly Japanese listeners. Twenty elderly listeners within an age group ranging from 63 to 80 years and nine young listeners in the age group of 22 to 25 years were used as subjects for the investigations. All participants were native speakers of Japanese who listened to stimuli presented to either the right or left ear with the low pure-tone threshold at 2 kHz. The pure tone thresholds for both ears were measured using an audiometer before the investigations. The auditory filter bandwidth at 2 kHz was also measured using a measurement system of auditory filters.

Misc.

 72

Works

 11

Research Projects

 37

Academic Activities

 1

Social Activities

 1

Other

 55
  • Apr, 2006 - Jun, 2008
    英語によるプレゼンテーションを学ぶ講義の中で、自分のプレゼンテーションを客観的に学生に示すため、発表風景をビデオに収め、後で学生にそれを見て自己評価させるようにしている。また、同内容で2回目のプレゼンテーションを行わせ、改善する努力を促す工夫もしている。
  • 2003 - Jun, 2008
    音響教育に関する委員会の委員を務め、教育セッション(例えば2006年12月に行われた日米音響学会ジョイント会議における教育セッション)をオーガナイズするなど。
  • 2003 - Jun, 2008
    音響教育に関する委員会の委員を務め、教育セッション(例えば2004年4月に行われた国際音響学会議における教育セッション)をオーガナイズするなど。特に2005年からは委員長を仰せつかり、精力的に活動している(例えば、2006年10月に国立博物館にて科学教室を開催)。
  • Apr, 2002 - Jun, 2008
    本学に赴任して以来、「Progress Report」と称して研究室の教育研究活動に関する報告書を作成し発行している。これにより、研究室の学生の意識の向上にも役立ち、効果を発揮している。
  • Apr, 2002 - Jun, 2008
    普段から英語に慣れておくことが重要であると考え、研究室の定例ミーティングの中で定期的に英語によるミーティングを行っている。また、2006年度からは研究グループごとに行われる毎回の進捗報告も英語で行うことを義務付けている。