Automatic Identification of Moroccan Colloquial Arabic
Autor: | Karim Bouzoubaa, Si Lhoussaine Aouragh, Ridouane Tachicart, Hamid Jaafa |
---|---|
Rok vydání: | 2018 |
Předmět: |
Stop words
Language identification Computer science business.industry Arabic languages computer.software_genre Task (project management) Support vector machine Identification (information) ComputingMethodologies_PATTERNRECOGNITION Classifier (linguistics) Language model Artificial intelligence business computer Natural language processing |
Zdroj: | Communications in Computer and Information Science ISBN: 9783319734996 ICALP |
Popis: | Language Identification is an NLP task which aims at predicting the language of a given text. For the Arabic dialects many attempts have been done to address this topic. In this paper, we present our approach to build a Language Identification system in order to distinguish between Moroccan Colloquial Arabic and Arabic languages using two different methods. The first is rule-based and relies on stop word frequency, while the second is statically-based and uses several machine learning classifiers. Obtained results show that the statistical approach outperforms the rule-based approach. Furthermore, the Support Vector Machines classifier is more accurate than other statistical classifiers. Our goal in this paper is to pave the way toward building advanced Moroccan dialect NLP tools such as morphological analyzer and machine translation system. |
Databáze: | OpenAIRE |
Externí odkaz: |