Autor: |
Singh, Kritika, Manohar, Vimal, Xiao, Alex, Edunov, Sergey, Girshick, Ross, Liptchinsky, Vitaliy, Fuegen, Christian, Saraf, Yatharth, Zweig, Geoffrey, Mohamed, Abdelrahman |
Rok vydání: |
2020 |
Předmět: |
|
Druh dokumentu: |
Working Paper |
Popis: |
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline. |
Databáze: |
arXiv |
Externí odkaz: |
|