Semi-supervised acoustic model training for five-lingual code-switched ASR

Autor:	Thomas Niesler, Ewald van der Westhuizen, Astik Biswas, Febe de Wet, Emre Yilmaz
Rok vydání:	2019
Předmět:	FOS: Computer and information sciences 050101 languages & linguistics Sound (cs.SD) Computer science Process (engineering) 02 engineering and technology computer.software_genre Computer Science - Sound Audio and Speech Processing (eess.AS) 0202 electrical engineering electronic engineering information engineering Code (cryptography) FOS: Electrical engineering electronic engineering information engineering 0501 psychology and cognitive sciences Computer Science - Computation and Language business.industry Time delay neural network 05 social sciences Languages of Africa Acoustic model Pipeline (software) 020201 artificial intelligence & image processing Artificial intelligence Language model Transcription (software) business computer Computation and Language (cs.CL) Natural language processing Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj:	INTERSPEECH
DOI:	10.48550/arxiv.1906.08647
Popis:	This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semi-supervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the five-lingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline. Comment: Accepted for publication at Interspeech 2019
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::ffe9629ef5d672df47b28219d8891f48 Zobrazit plný text záznamu