Semi-supervised acoustic model training for five-lingual code-switched ASR
Autor: | Thomas Niesler, Ewald van der Westhuizen, Astik Biswas, Febe de Wet, Emre Yilmaz |
---|---|
Rok vydání: | 2019 |
Předmět: |
FOS: Computer and information sciences
050101 languages & linguistics Sound (cs.SD) Computer science Process (engineering) 02 engineering and technology computer.software_genre Computer Science - Sound Audio and Speech Processing (eess.AS) 0202 electrical engineering electronic engineering information engineering Code (cryptography) FOS: Electrical engineering electronic engineering information engineering 0501 psychology and cognitive sciences Computer Science - Computation and Language business.industry Time delay neural network 05 social sciences Languages of Africa Acoustic model Pipeline (software) 020201 artificial intelligence & image processing Artificial intelligence Language model Transcription (software) business computer Computation and Language (cs.CL) Natural language processing Electrical Engineering and Systems Science - Audio and Speech Processing |
Zdroj: | INTERSPEECH |
DOI: | 10.48550/arxiv.1906.08647 |
Popis: | This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semi-supervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the five-lingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline. Comment: Accepted for publication at Interspeech 2019 |
Databáze: | OpenAIRE |
Externí odkaz: |