Semi-supervised acoustic model training for five-lingual code-switched ASR

Autor: Thomas Niesler, Ewald van der Westhuizen, Astik Biswas, Febe de Wet, Emre Yilmaz
Rok vydání: 2019
Předmět:
FOS: Computer and information sciences
050101 languages & linguistics
Sound (cs.SD)
Computer science
Process (engineering)
02 engineering and technology
computer.software_genre
Computer Science - Sound
Audio and Speech Processing (eess.AS)
0202 electrical engineering
electronic engineering
information engineering

Code (cryptography)
FOS: Electrical engineering
electronic engineering
information engineering

0501 psychology and cognitive sciences
Computer Science - Computation and Language
business.industry
Time delay neural network
05 social sciences
Languages of Africa
Acoustic model
Pipeline (software)
020201 artificial intelligence & image processing
Artificial intelligence
Language model
Transcription (software)
business
computer
Computation and Language (cs.CL)
Natural language processing
Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj: INTERSPEECH
DOI: 10.48550/arxiv.1906.08647
Popis: This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semi-supervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the five-lingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline.
Comment: Accepted for publication at Interspeech 2019
Databáze: OpenAIRE