Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Autor:	Joyce Nakatumba‐Nabende, Claire Babirye, Peter Nabende, Jeremy Francis Tusubira, Jonathan Mukiibi, Eric Peter Wairagala, Chodrine Mutebi, Tobius Saul Bateesa, Alvin Nahabwe, Hewitt Tusiime, Andrew Katumba
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	automatic speech recognition low‐resourced language machine translation speech dataset text dataset topic modeling Electronic computers. Computer science QA75.5-76.95
Zdroj:	Applied AI Letters, Vol 5, Iss 2, Pp n/a-n/a (2024)
Druh dokumentu:	article
ISSN:	2689-5595
DOI:	10.1002/ail2.92
Popis:	ABSTRACT Africa has over 2000 languages; however, those languages are not well represented in the existing natural language processing ecosystem. African languages lack essential digital resources to effectively engage in advancing language technologies. There is a need to generate high‐quality natural language processing resources for low‐resourced African languages. Obtaining high‐quality speech and text data is expensive and tedious because it can involve manual sourcing and verification of data sources. This paper discusses the process taken to curate and annotate text and speech datasets for five East African languages: Luganda, Runyankore‐Rukiga, Acholi, Lumasaba, and Swahili. We also present results obtained from baseline models for machine translation, topic modeling and classification, sentiment classification, and automatic speech recognition tasks. Finally, we discuss the experiences, challenges, and lessons learned in creating the text and speech datasets.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/20d1ff9f01e045f29424c88e5c172d7f Zobrazit plný text záznamu Plný text View record in DOAJ