Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data.

Autor: Cao B; Speech Disorders & Technology Lab, Department of Bioengineering, University of Texas at Dallas, Richardson, Texas, United States., Kim M; Speech Disorders & Technology Lab, Department of Bioengineering, University of Texas at Dallas, Richardson, Texas, United States., Mau T; Department of Otolaryngology - Head and Neck Surgery University of Texas Southwestern Medical Center, Dallas, Texas, United States., Wang J; Speech Disorders & Technology Lab, Department of Bioengineering, University of Texas at Dallas, Richardson, Texas, United States.; Callier Center for Communication Disorders, University of Texas at Dallas, Richardson, Texas, United States.
Jazyk: angličtina
Zdroj: Workshop on Speech and Language Processing for Assistive Technologies [Workshop Speech Lang Process Assist Technol] 2016 Sep; Vol. 2016, pp. 80-86.
DOI: 10.21437/SLPAT.2016-14
Abstrakt: Individuals with larynx (vocal folds) impaired have problems in controlling their glottal vibration, producing whispered speech with extreme hoarseness. Standard automatic speech recognition using only acoustic cues is typically ineffective for whispered speech because the corresponding spectral characteristics are distorted. Articulatory cues such as the tongue and lip motion may help in recognizing whispered speech since articulatory motion patterns are generally not affected. In this paper, we investigated whispered speech recognition for patients with reconstructed larynx using articulatory movement data. A data set with both acoustic and articulatory motion data was collected from a patient with surgically reconstructed larynx using an electromagnetic articulograph. Two speech recognition systems, Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural network-HMM (DNN-HMM), were used in the experiments. Experimental results showed adding either tongue or lip motion data to acoustic features such as mel-frequency cepstral coefficient (MFCC) significantly reduced the phone error rates on both speech recognition systems. Adding both tongue and lip data achieved the best performance.
Databáze: MEDLINE