Toward Streaming ASR with Non-Autoregressive Insertion-Based Model

Autor:	Tianzi Wang, Shinji Watanabe, Yuya Fujita, Motoi Omachi
Rok vydání:	2021
Předmět:	Sequence Artificial neural network Computer science Speech recognition Latency (audio) Security token ComputingMethodologies_ARTIFICIALINTELLIGENCE Connectionism Autoregressive model Audio and Speech Processing (eess.AS) FOS: Electrical engineering electronic engineering information engineering Segmentation Electrical Engineering and Systems Science - Audio and Speech Processing Transformer (machine learning model)
Zdroj:	Interspeech 2021.
DOI:	10.21437/interspeech.2021-1131
Popis:	Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems. When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording. After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster. Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an $N$-length token sequence with less than $N$ iterations. We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR. As a non-autoregressive ASR, the insertion-based model is used. In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and non-autoregressive ASR by a single neural network. Experimental results on Japanese and English dataset show that the method achieved a reasonable trade-off between accuracy and RTF compared with baseline autoregressive Transformer and connectionist temporal classification.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::6154c2683e7dfa35c0b913d87fa38b49 https://doi.org/10.21437/interspeech.2021-1131 Zobrazit plný text záznamu