Developing Speech Resources from Parliamentary Data for South African English

Autor:	Thipe Isaiah Modipa, Febe de Wet, Jaco Badenhorst
Rok vydání:	2016
Předmět:	South African English automatic alignment speech data business.industry Computer science Under-resourced languages 020206 networking & telecommunications 02 engineering and technology Pronunciation computer.software_genre language.human_language 0202 electrical engineering electronic engineering information engineering language General Earth and Planetary Sciences 020201 artificial intelligence & image processing Artificial intelligence Transcription (software) business computer Natural language processing General Environmental Science
Zdroj:	SLTU
ISSN:	1877-0509
DOI:	10.1016/j.procs.2016.04.028
Popis:	The official languages of South Africa can still be classified as under-resourced with respect to the speech resources that are required for technology development. Harvesting speech data from existing sources is one means to create additional resources. The aim of the study reported on in this paper was to improve the harvesting and transcription accuracy of a corpus derived from parliamentary data. This aim was achieved by improving on the text normalisation process and pronunciation modelling as well as by iteratively training more accurate in-domain acoustic models. In this manner, more data could be harvested with higher confidence than using baseline pronunciation dictionaries and out-of-domain speech data.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::69cd9fd2fe9cce4f857b42153eb85e3d Zobrazit plný text záznamu