Igbo Diacritic Restoration using Embedding Models

Autor:	Enemouh Chioma, Ignatius Ezeani, Mark Hepple, Ikechukwu E. Onyenwe
Rok vydání:	2018
Předmět:	Space (punctuation) Word embedding Computer science business.industry First language Lexical ambiguity Igbo 02 engineering and technology Pronunciation computer.software_genre language.human_language 03 medical and health sciences 0302 clinical medicine Diacritic 030221 ophthalmology & optometry 0202 electrical engineering electronic engineering information engineering language Embedding 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing Meaning (linguistics)
Zdroj:	NAACL-HLT (Student Research Workshop)
DOI:	10.18653/v1/n18-4008
Popis:	Igbo is a low-resource language spoken by approximately 30 million people worldwide. It is the native language of the Igbo people of south-eastern Nigeria. In Igbo language, diacritics - orthographic and tonal - play a huge role in the distinguishing the meaning and pronunciation of words. Omitting diacritics in texts often leads to lexical ambiguity. Diacritic restoration is a pre-processing task that replaces missing diacritics on words from which they have been removed. In this work, we applied embedding models to the diacritic restoration task and compared their performances to those of n-gram models. Although word embedding models have been successfully applied to various NLP tasks, it has not been used, to our knowledge, for diacritic restoration. Two classes of word embeddings models were used: those projected from the English embedding space; and those trained with Igbo bible corpus (≈ 1m). Our best result, 82.49%, is an improvement on the baseline n-gram models.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::0303347ad885c2f1322c706d8744e02d https://doi.org/10.18653/v1/n18-4008 Zobrazit plný text záznamu