Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification

Autor:	Yahya Aseri, Khalid Alreemy, Salem Alelyani, Mohamed Mohanna
Rok vydání:	2022
Zdroj:	Embedded Systems and Applications.
DOI:	10.5121/csit.2022.120628
Popis:	Dialect identification is a prior requirement for learning lexical and morphological knowledge a language variation that can be beneficial for natural language processing (NLP) and potential AI downstream tasks. In this paper, we present the first work on sentence-level Modern Standard Arabic (MSA) and Saudi Dialect (SD) identification where we trained and tested three classifiers (Logistic regression, Multi-nominal Na¨ıve Bayes, and Support Vector Machine) on datasets collected from Saudi Twitter and automatically labeled as (MSA) or SD. The model for each configuration was built using two levels of language models, i.e., unigram and bi-gram, as feature sets for training the systems. The model reported high-accuracy performance using 10-fold cross- validations with average 98.98%. This model was evaluated on another unseen, manually-annotated dataset. The best performance of these classifiers was achieved by Multi-nominal Naïve Bayes, reporting 89%.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::68bddc2ed5a0b139d9f8534872f0f57d https://doi.org/10.5121/csit.2022.120628 Zobrazit plný text záznamu