Audio-Visual Multi-Channel Recognition of Overlapped Speech
Autor: | Helen Meng, Dong Yu, Rongzhi Gu, Xunying Liu, Meng Yu, Bo Wu, Lianwu Chen, Yong Xu, Shi-Xiong Zhang, Jianwei Yu, Dan Su |
---|---|
Rok vydání: | 2020 |
Předmět: |
Beamforming
Reduction (complexity) Masking (art) Microphone array Signal-to-noise ratio Computer Science::Sound Computer science Speech recognition Word error rate Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) Filter (signal processing) Interpolation |
Zdroj: | INTERSPEECH |
DOI: | 10.21437/interspeech.2020-2346 |
Popis: | Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter\&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively. |
Databáze: | OpenAIRE |
Externí odkaz: |