On the role of subglottal acoustics in height estimation, and speech and speaker recognition

Autor: Arsikere, Harish
Jazyk: angličtina
Rok vydání: 2014
Předmět:
Zdroj: Arsikere, Harish. (2014). On the role of subglottal acoustics in height estimation, and speech and speaker recognition. UCLA: Electrical Engineering 0303. Retrieved from: http://www.escholarship.org/uc/item/2fz2q7s8
Popis: The subglottal system comprises the trachea, bronchi and their accompanying airways. Its configuration changes very little compared to that of the supraglottal vocal tract, as a result of which its acoustic properties are relatively more stationary and speaker specific. In this dissertation, our knowledge of subglottal acoustics - subglottal resonances (SGRs), most importantly - is leveraged to develop novel solutions to three problems that involve using or estimating speaker-specific characteristics: (1) body height estimation, (2) speaker normalization for automatic speech recognition (ASR), and (3) speaker identification (SID) and verification (SV). The focus is on scenarios where purely statistical methods may be sub-optimal owing to limited and/or noisy speech data.Simultaneous recordings of speech and subglottal acoustics are collected (using a microphone and an accelerometer, respectively) from native American English speakers (50 adults and 43 children) and 6 adult bilingual speakers of Mexican Spanish (first language) and American English. The data are analyzed to understand the relationships between SGRs, and vocal-tract resonances (formants), body height and native language. Results indicate that (1) phonological vowel features (tongue height and backness) can be characterized via acoustic measures of formants and SGRs, (2) SGRs correlate well with body height, and (3) SGRs are practically independent of language and phonetic content. Based on these findings, algorithms are developed for the automatic estimation of SGRs from speech signals (i.e., without using accelerometer information). The algorithms are found to be effective for both adults and children, in quiet as well as noisy environments; their performance is equally good for native English and bilingual English/Spanish speakers, and does not degrade much with limited data.Predictive models between body height and SGRs (in conjunction with SGR estimation algorithms) are used to develop an automatic approach to speech-based height estimation for adult speakers. The method is comparable in performance to existing data-driven techniques, but requires less training data, offers better generalization, and is more robust to noise. In the context of ASR for children, SGRs are used for speaker normalization via piece-wise linear frequency warping. On a digit-recognition task, the method achieves lower word error rates than conventional vocal-tract length normalization, in clean as well as noisy environments. The benefit is particularly significant for young speakers (6-8 years old) and short utterances (1 or 2 words). For SID and SV (with adults' speech), an algorithm is developed for deriving subglottal features that are more informative (than SGRs) with regard to speaker discriminability. When combined with Mel-frequency cepstral coefficients (conventional speech features for SID and SV), subglottal features provide significant performance improvements, especially for short test utterances (5-10 seconds in duration).
Databáze: OpenAIRE