Index Term Selection Heuristics for Arabic Text Retrieval
Autor: | Yaser A. Al-Lahham |
---|---|
Rok vydání: | 2021 |
Předmět: |
Root (linguistics)
Multidisciplinary Basis (linear algebra) business.industry Computer science InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL 010102 general mathematics computer.software_genre 01 natural sciences Prefix Index (publishing) Simple (abstract algebra) Index term Artificial intelligence 0101 mathematics Heuristics business computer Selection (genetic algorithm) Natural language processing |
Zdroj: | Arabian Journal for Science and Engineering. 46:3345-3355 |
ISSN: | 2191-4281 2193-567X |
DOI: | 10.1007/s13369-020-05022-3 |
Popis: | The Arabic index term selection is a challenging process due to the complex morphological nature of the Arabic language. Index term selection is a significant factor that affects the efficiency of any information retrieval system. Many methods of index term selection were proposed in the literature. The majority of them were based on root extraction and stemming. Other proposals apply complex linguistic rules and machine learning tools. This paper proposes a simple index term selection method using some heuristics such that a representative subset of terms is selected to form the index. The proposed heuristics essentially select index terms from Arabic words having the prefix ‘AL’ (definite words) as a basis. Besides, the proposed method selects new words according to any of the following heuristics: the words preceding or words succeeding definite terms, choosing words that follow some linking words and words following propositions in semi-sentences, and selecting words that represent named entities. The proposed heuristics were tested using the TREC-2001/2002 Arabic test collection. The results show the effectiveness of the proposed method since it outperforms selecting all terms stemmed by two well-known stemmers. For example, choosing definite words and words that represent named entities outperforms selecting all terms stemmed by the LIGHT10 stemmer according to the mean average precision by 8.4% and at the same time decreases the index size by 27.8%. |
Databáze: | OpenAIRE |
Externí odkaz: | |
Nepřihlášeným uživatelům se plný text nezobrazuje | K zobrazení výsledku je třeba se přihlásit. |