Multi-Modal Data Augmentation for End-to-End ASR
Autor: | Matthew Wiesner, Shinji Watanabe, Shuoyang Ding, Adithya Renduchintala |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2018 |
Předmět: |
Text corpus
FOS: Computer and information sciences Sound (cs.SD) Computer science Speech recognition Word error rate 02 engineering and technology Computer Science - Sound 030507 speech-language pathology & audiology 03 medical and health sciences chemistry.chemical_compound End-to-end principle Audio and Speech Processing (eess.AS) 0202 electrical engineering electronic engineering information engineering FOS: Electrical engineering electronic engineering information engineering Architecture Computer Science - Computation and Language Character (computing) MMDA 020206 networking & telecommunications chemistry Language model 0305 other medical science Encoder Computation and Language (cs.CL) Electrical Engineering and Systems Science - Audio and Speech Processing |
Zdroj: | INTERSPEECH |
Popis: | We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input and enables seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on character error rate (CER), and as much as 7-10\% relative word error rate (WER) improvement over a baseline both with and without an external language model. 5 Pages, 1 Figure, accepted at INTERSPEECH 2018 |
Databáze: | OpenAIRE |
Externí odkaz: |