Automatic audiovisual synchronisation for ultrasound tongue imaging

Autor:	Eleanor Sugden, Manuel Sam Ribeiro, Korin Richmond, Aciel Eshky, Joanne Cleland, Steve Renals
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Linguistics and Language Speech production Computer Science - Machine Learning Sound (cs.SD) ultrasound tongue imaging Computer science Speech recognition synchronisation error tolerance 02 engineering and technology 01 natural sciences Language and Linguistics Computer Science - Sound Machine Learning (cs.LG) Resource (project management) Audio and Speech Processing (eess.AS) 0103 physical sciences 0202 electrical engineering electronic engineering information engineering FOS: Electrical engineering electronic engineering information engineering 010301 acoustics Data collection Modalities Computer Science - Computation and Language automatic audiovisual synchronisation Artificial neural network business.industry Communication Image and Video Processing (eess.IV) Usability Phonetics Electrical Engineering and Systems Science - Image and Video Processing Computer Science Applications Modeling and Simulation 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition business Error detection and correction Computation and Language (cs.CL) Software Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj:	Eshky, A, Cleland, J, Ribeiro, M S, Sugden, E, Richmond, K & Renals, S 2021, ' Automatic audiovisual synchronisation for ultrasound tongue imaging ', Speech Communication, vol. 132, pp. 83-95 . https://doi.org/10.1016/j.specom.2021.05.008
ISSN:	0167-6393
Popis:	Ultrasound tongue imaging is used to visualise the intra-oral articulators during speech production. It is utilised in a range of applications, including speech and language therapy and phonetics research. Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised. Synchronisation is achieved using specialised hardware at recording time, but this approach can fail in practice resulting in data of limited usability. In this paper, we address the problem of automatically synchronising ultrasound and audio after data collection. We first investigate the tolerance of expert ultrasound users to synchronisation errors in order to find the thresholds for error detection. We use these thresholds to define accuracy scoring boundaries for evaluating our system. We then describe our approach for automatic synchronisation, which is driven by a self-supervised neural network, exploiting the correlation between the two signals to synchronise them. We train our model on data from multiple domains with different speaker characteristics, different equipment, and different recording environments, and achieve an accuracy >92.4% on held-out in-domain data. Finally, we introduce a novel resource, the Cleft dataset, which we gathered with a new clinical subgroup and for which hardware synchronisation proved unreliable. We apply our model to this out-of-domain data, and evaluate its performance subjectively with expert users. Results show that users prefer our model's output over the original hardware output 79.3% of the time. Our results demonstrate the strength of our approach and its ability to generalise to data from new domains. 18 pages, 10 figures. Manuscript accepted at Speech Communication
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::36ab08ddc455d9a8e7feae7afb3c9e5f https://hdl.handle.net/20.500.11820/9e592bfa-43d4-4093-91cc-e0a9d052cf8a Zobrazit plný text záznamu Full Text from ScienceDirect