Comparison of different lemmatization approaches for information retrieval on Turkish text collection
Autor: | Adil Alpkocak, Okan Ozturkmenoglu |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2012 |
Předmět: |
Finite-state machine
Information retrieval Parsing Computer science Turkish business.industry Lemmatisation Data structure computer.software_genre language.human_language language Artificial intelligence business computer Word (computer architecture) Natural language Natural language processing Lemma (morphology) |
Popis: | In this paper, we compare the performance of different lemmatization approaches for information retrieval over Turkish text collection. A lemma is simply the "dictionary form" of a word and lemmatization is the process of determining the lemma for a given word where different inflected forms of a word can be analyzed as a single item. We compared three different lemmatizer and one fixed length truncation approaches over Turkish text collection. The first one is based on morphological analyzer for Turkish using with finite state language processing technology; another one is Dictionary-based Turkish Lemmatizer (DTL), which uses radix-trie data structure; the third one is a simple dictionary based top-down parser and the last one is truncation of words at fix length. We have assessed the performance of lemmatizers on Bilkent University Milliyet collection, which contains more than 400K documents. The comparison of performance analysis was done by the well-known IR evaluation metrics and experimented in the IR system. The results we obtained show that the lemmatization process improves IR performance and we achieved the best results using with Turkish Lemmatizer that is DTL radix-trie data structure and it used the minimum number of terms in IR system. © 2012 IEEE. |
Databáze: | OpenAIRE |
Externí odkaz: |