Moroccan Arabizi-to-Arabic conversion using rule-based transliteration and weighted Levenshtein algorithm

Autor: Soufiane Hajbi, Omayma Amezian, Nawfal El Moukhi, Redouan Korchiyne, Younes Chihab
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: Scientific African, Vol 23, Iss , Pp e02073- (2024)
Druh dokumentu: article
ISSN: 2468-2276
DOI: 10.1016/j.sciaf.2024.e02073
Popis: The rise of social media has contributed to the widespread of the Arabizi writing form, primarily used in colloquial communication. For Natural Language Processing (NLP) tools, processing texts in this form remains challenging due to the lack of suitable language resources. Additionally, there is a lack of standardized rules in the transliteration mapping between Arabizi and Arabic, resulting in variations across different dialectal groups. To address these limitations in the context of Moroccan Darija (MD), this work proposes a method for converting Arabizi to Arabic at the word level. This method involves a sequential combination of rule-based transliteration and weighted Levenshtein algorithm. The contributions of this approach include: (i) Building a large MD dataset that incorporates texts reflecting the characteristics of MD and the colloquial writing forms usually used in the Arabizi writing form. (ii) Generating transliteration rules tailored to MD. (iii) Adapting the edit costs within the weighted Levenshtein algorithm to enhance conversion performance. Successful tests have been conducted and the approach was applied to three datasets: two state-of-the-art Darija-Modern Standard Arabic (MSA) datasets and the MD dataset collected as part of this work. The proposed method achieved a Mean Reciprocal Rank (MRR) of 92.14% and an accuracy of 88.44%.
Databáze: Directory of Open Access Journals