Arabic Diacritic Recovery Using a Feature-rich biLSTM Model

Autor:	Mohamed Eldesouki, Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Feature engineering Computer Science - Machine Learning General Computer Science Computer science Word error rate 02 engineering and technology computer.software_genre Machine Learning (cs.LG) 030507 speech-language pathology & audiology 03 medical and health sciences 0202 electrical engineering electronic engineering information engineering Computer Science - Computation and Language business.industry Variety (linguistics) language.human_language Feature (linguistics) Arabic diacritics Diacritic language Modern Standard Arabic 020201 artificial intelligence & image processing Artificial intelligence 0305 other medical science Classical Arabic business Computation and Language (cs.CL) computer Natural language processing
Zdroj:	ACM Transactions on Asian and Low-Resource Language Information Processing. 20:1-18
ISSN:	2375-4702 2375-4699
Popis:	Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this article, we use feature-rich recurrent neural network model that use a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.9% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rates are 6.0% and 4.3% for MSA and CA, respectively. This highlights the effectiveness of feature engineering for such deep neural models.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::ce3f0b5f33e0066c0d0aa9acccd149da https://doi.org/10.1145/3434235 Zobrazit plný text záznamu