CNN-based phoneme classifier from vocal tract MRI learns embedding consistent with articulatory topology

Autor:	van Leeuwen, K. G., Bos, P., Trebeschi, S., van Alphen, M. J. A., Voskuilen, L., Smeele, L. E., van der Heijden, F., van Son, R. J. J. H., Kubin, Gernot, Hain, Thomas, Schuller, Bjorn, El Zarka, Dina, Hodl, Petra
Přispěvatelé:	Maxillofacial Surgery (AMC), Technical Medicine, Robotics and Mechatronics, MKA AMC (OII, ACTA), ACLC (FGw), Oral and Maxillofacial Surgery
Jazyk:	angličtina
Rok vydání:	2019
Předmět:	SDG 16 - Peace Computer science business.industry Deep learning American English SDG 16 - Peace Justice and Strong Institutions Topology Convolutional neural network Justice and Strong Institutions Vowel diagram Articulatory-to-acoustic mapping Speech analysis Artificial intelligence Convolutional neural networks (CNN) Magnetic resonance imaging (MRI) business Classifier (UML) Vocal tract
Zdroj:	Interspeech, 20, 909-913 van Leeuwen, K G, Bos, P, Trebeschi, S, van Alphen, M J A, Voskuilen, L, Smeele, L E, van der Heijden, F & van Son, R J J H 2019, ' CNN-based phoneme classifier from vocal tract MRI learns embedding consistent with articulatory topology ', Interspeech, vol. 20, pp. 909-913 . https://doi.org/10.21437/Interspeech.2019-1173 ISSUE=20;STARTPAGE=909;ENDPAGE=913;TITLE=20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 INTERSPEECH INTERSPEECH 2019-Crossroads of Speech and Language, 2019-September, 909-913
ISSN:	1990-9772
DOI:	10.21437/Interspeech.2019-1173
Popis:	Recent advances in real-time magnetic resonance imaging (rtMRI) of the vocal tract provides opportunities for studying human speech. This modality together with acquired speech may enable the mapping of articulatory configurations to acoustic features. In this study, we take the first step by training a deep learning model to classify 27 different phonemes from midsagittal MR images of the vocal tract. An American English database was used to train a convolutional neural network for classifying vowels (13 classes), consonants (14 classes) and all phonemes (27 classes) of 17 subjects. Classification top-1 accuracy of the test set for all phonemes was 57%. Error analysis showed voiced and unvoiced sounds often being confused. Moreover, we performed principal component analysis on the network's embedding and observed topological similarities between the network learned representation and the vowel diagram. Saliency maps gave insight into the anatomical regions most important for classification and show congruence with known regions of articulatory importance. We demonstrate the feasibility for deep learning to distinguish between phonemes from MRI. Network analysis can be used to improve understanding of normal articulation and speech and, in the future, impaired speech. This study brings us a step closer to the articulatory-to-acoustic mapping from rtMRI.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d922374a6ff9daee10b2ec6c0ec8ae41 https://research.vu.nl/en/publications/d47edf5c-1237-4dff-b6fe-08d126afd47a Zobrazit plný text záznamu