Autor: |
Naili, Marwa, Chaibi, Anja Habacha, Ghezala, Henda Hajjami Ben |
Zdroj: |
ACM Transactions on Asian and Low-Resource Language Information Processing; January 2018, Vol. 17 Issue: 2 p1-25, 25p |
Abstrakt: |
Topic Segmentation is one of the pillars of Natural Language Processing. Yet there is a remarkable research gap in this field, as far as the Arabic language is concerned. The purpose of this article is to improve Arabic Topic Segmentation (ATS) by inquiring into two segmenters: ArabC99 and ArabTextTiling. This study is carried out on two independent levels: the pre-processing level and the segmentation level. These levels represent the basic steps of topic segmentation. On the pre-processing level, we examine the effect of using different Arabic stemming algorithms on ATS. We find out that Light10 is more appropriate for the pre-processing step. Based on this conclusion, we proceed to the second level by proposing two Arabic segmenters called ArabC99-LS-LSA and ArabTextTiling-LS-LSA. These latter use external semantic knowledge related to the Latent Semantic Analysis (LSA). Based on the evaluation results, we notice that LSA provides improvements in this field. Hence, the main outcome of this article emphasizes the multilevel improvement of ATS based on Light10 and LSA. |
Databáze: |
Supplemental Index |
Externí odkaz: |
|