Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Autor: Andrei Andrusenko, Ivan Medennikov, Aleksandr Laptev, Yuri Matveev, Anton Mitrofanov, Ivan Podluzhny
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Byte pair encoding
FOS: Computer and information sciences
Computer Science - Machine Learning
Vocabulary
Sound (cs.SD)
Computer science
Speech recognition
media_common.quotation_subject
Word error rate
TP1-1185
02 engineering and technology
Security token
Biochemistry
Computer Science - Sound
Article
Analytical Chemistry
Personalization
Machine Learning (cs.LG)
Audio and Speech Processing (eess.AS)
augmentation
0202 electrical engineering
electronic engineering
information engineering

FOS: Electrical engineering
electronic engineering
information engineering

Speech
Electrical and Electronic Engineering
Instrumentation
Dropout (neural networks)
end-to-end speech recognition
out-of-vocabulary
media_common
BABEL Turkish
Computer Science - Computation and Language
Chemical technology
020206 networking & telecommunications
Acoustics
Atomic and Molecular Physics
and Optics

BPE-dropout
Task (computing)
low-resource
Hybrid system
BABEL Georgian
Speech Perception
transformer
020201 artificial intelligence & image processing
Speech Recognition Software
Computation and Language (cs.CL)
Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj: Sensors
Volume 21
Issue 9
Sensors (Basel, Switzerland)
Sensors, Vol 21, Iss 3063, p 3063 (2021)
ISSN: 1424-8220
DOI: 10.3390/s21093063
Popis: With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Another challenging task associated with speech assistants is personalization, which mainly lies in handling out-of-vocabulary (OOV) words. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. To address the aforementioned problems, we propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. It non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative WER and 25% relative F-score) at no additional computational cost. Owing to the use of BPE-dropout, our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is close to the best published multilingual system.
Comment: 16 pages, 7 figures
Databáze: OpenAIRE
Nepřihlášeným uživatelům se plný text nezobrazuje