Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Autor:	Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani
Rok vydání:	2022
Předmět:	FOS: Computer and information sciences Computer Science - Computation and Language Audio and Speech Processing (eess.AS) FOS: Electrical engineering electronic engineering information engineering Computation and Language (cs.CL) Electrical Engineering and Systems Science - Audio and Speech Processing
DOI:	10.48550/arxiv.2209.06058
Popis:	Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::48bd4fe6333011053c8e723968f9842b Zobrazit plný text záznamu