Investigating Self-supervised Pre-training for End-to-end Speech Translation
Autor: | Fethi Bougares, Natalia A. Tomashenko, Ha Nguyen, Yannick Estève, Laurent Besacier |
---|---|
Přispěvatelé: | Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), Laboratoire Informatique d'Avignon (LIA), Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU), Institut Universitaire de France (IUF), Ministère de l'Education nationale, de l’Enseignement supérieur et de la Recherche (M.E.N.E.S.R.), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019) |
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
Training set
Computer science Speech recognition 05 social sciences 010501 environmental sciences 01 natural sciences [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] End-to-end principle Speech translation 0502 economics and business Automatic speech 050207 economics Orthography 0105 earth and related environmental sciences |
Zdroj: | Interspeech 2020 Interspeech 2020, Oct 2020, Shangai (Virtual Conf), China INTERSPEECH |
Popis: | International audience; Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). We investigate here its impact on end-to-end automatic speech translation (AST) performance. We use a contrastive predic-tive coding (CPC) model pre-trained from unlabeled speech as a feature extractor for a downstream AST task. We show that self-supervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves performance. Even in higher resource settings, ensembling AST models trained with filter-bank and CPC representations leads to near state-of-the-art models without using any ASR pre-training. This might be particularly beneficial when one needs to develop a system that translates from speech in a language with poorly standardized orthography or even from speech in an unwritten language. Index Terms: self-supervised learning from speech, automatic speech translation, end-to-end models, low resource settings. |
Databáze: | OpenAIRE |
Externí odkaz: |