Tokenization of Tunisian Arabic: a comparison between three Machine Learning models

Autor: Asma Mekki, Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith
Rok vydání: 2023
Předmět:
Zdroj: ACM Transactions on Asian and Low-Resource Language Information Processing.
ISSN: 2375-4702
2375-4699
DOI: 10.1145/3599234
Popis: Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, etc. In this paper, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses Tunisian Arabic for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.
Databáze: OpenAIRE