Learning When to Translate for Streaming Speech

Autor:	Qian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Computer Science - Computation and Language Audio and Speech Processing (eess.AS) FOS: Electrical engineering electronic engineering information engineering Computation and Language (cs.CL) Electrical Engineering and Systems Science - Audio and Speech Processing
DOI:	10.48550/arxiv.2109.07368
Popis:	How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long speech sequence, we develop an efficient monotonic segmentation module inside an encoder-decoder model to accumulate acoustic information incrementally and detect proper speech unit boundaries for the input in speech translation task. Experiments on multiple translation directions of the MuST-C dataset show that MoSST outperforms existing methods and achieves the best trade-off between translation quality (BLEU) and latency. Our code is available at https://github.com/dqqcasia/mosst. Comment: Accept to ACL 2022 main conference. 15 pages, 6 figures
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::90905d0d1990a2ac4f0ec3104e93e45c Zobrazit plný text záznamu