The Making of Lingala Corpus: An Under-resourced Language and the Internet

Autor:	Bienvenu Sene-Mongaba
Rok vydání:	2015
Předmět:	Lingala Unitex Computer science spelling standardization computer.software_genre NLP Corpus linguistics Selection (linguistics) corpus cleaning General Materials Science business.industry African languages Making-of Spelling language.human_language Linguistics Congo language ComputingMethodologies_DOCUMENTANDTEXTPROCESSING The Internet Artificial intelligence business computer under-resourced language Natural language processing
Zdroj:	Procedia - Social and Behavioral Sciences. 198:442-450
ISSN:	1877-0428
DOI:	10.1016/j.sbspro.2015.07.464
Popis:	Lingala is now the most widespread language in Congo. The Internet provides a great amount of data. This paper has attempted to elucidate the issues that are involved with building a corpus for an under-resourced language where access to internet texts is difficult. To extract Lingala text from a mass of French text, it has been necessary to go through a process of selection by seed words list. The raw corpus is composed of 6,080,426 tokens. I have intervened on the data from internet sources by standardizing the spelling. This standardized corpus is stored separately from the raw corpus.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::896962984537fcbe52e03ae752a09a1c Zobrazit plný text záznamu