Automatic audiovisual synchronisation for ultrasound tongue imaging

Autor: Eleanor Sugden, Manuel Sam Ribeiro, Korin Richmond, Aciel Eshky, Joanne Cleland, Steve Renals
Jazyk: angličtina
Rok vydání: 2021
Předmět:
FOS: Computer and information sciences
Linguistics and Language
Speech production
Computer Science - Machine Learning
Sound (cs.SD)
ultrasound tongue imaging
Computer science
Speech recognition
synchronisation error tolerance
02 engineering and technology
01 natural sciences
Language and Linguistics
Computer Science - Sound
Machine Learning (cs.LG)
Resource (project management)
Audio and Speech Processing (eess.AS)
0103 physical sciences
0202 electrical engineering
electronic engineering
information engineering

FOS: Electrical engineering
electronic engineering
information engineering

010301 acoustics
Data collection
Modalities
Computer Science - Computation and Language
automatic audiovisual synchronisation
Artificial neural network
business.industry
Communication
Image and Video Processing (eess.IV)
Usability
Phonetics
Electrical Engineering and Systems Science - Image and Video Processing
Computer Science Applications
Modeling and Simulation
020201 artificial intelligence & image processing
Computer Vision and Pattern Recognition
business
Error detection and correction
Computation and Language (cs.CL)
Software
Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj: Eshky, A, Cleland, J, Ribeiro, M S, Sugden, E, Richmond, K & Renals, S 2021, ' Automatic audiovisual synchronisation for ultrasound tongue imaging ', Speech Communication, vol. 132, pp. 83-95 . https://doi.org/10.1016/j.specom.2021.05.008
ISSN: 0167-6393
Popis: Ultrasound tongue imaging is used to visualise the intra-oral articulators during speech production. It is utilised in a range of applications, including speech and language therapy and phonetics research. Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised. Synchronisation is achieved using specialised hardware at recording time, but this approach can fail in practice resulting in data of limited usability. In this paper, we address the problem of automatically synchronising ultrasound and audio after data collection. We first investigate the tolerance of expert ultrasound users to synchronisation errors in order to find the thresholds for error detection. We use these thresholds to define accuracy scoring boundaries for evaluating our system. We then describe our approach for automatic synchronisation, which is driven by a self-supervised neural network, exploiting the correlation between the two signals to synchronise them. We train our model on data from multiple domains with different speaker characteristics, different equipment, and different recording environments, and achieve an accuracy >92.4% on held-out in-domain data. Finally, we introduce a novel resource, the Cleft dataset, which we gathered with a new clinical subgroup and for which hardware synchronisation proved unreliable. We apply our model to this out-of-domain data, and evaluate its performance subjectively with expert users. Results show that users prefer our model's output over the original hardware output 79.3% of the time. Our results demonstrate the strength of our approach and its ability to generalise to data from new domains.
18 pages, 10 figures. Manuscript accepted at Speech Communication
Databáze: OpenAIRE