A hybrid Statistical Approach to Stemming in Turkish: An Agglutinative Language

Autor: Bahar Karaoglan, Tarik Kisla
Přispěvatelé: Ege Üniversitesi
Rok vydání: 2016
Předmět:
Zdroj: Volume: 17, Issue: 2 401-412
Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering
ISSN: 2146-0205
1302-3160
DOI: 10.18038/btda.31812
Popis: Finding Stem is a complicated and important issue for agglutinative languages like Turkish where theoretically infinite number of surface forms can be obtained from a single lexeme. Both analytical and statistical approaches have been tried for stemming Turkish words. Two main problems that become apparent with these approaches are the involvement of a dictionary which enforces the assumption of closed vocabulary and the disambiguation of the actual stem among the numerous candidates. Here, we present a method that exploits the simple fact that nouns and verbs have different suffix patterns. We also use statistical methods which are used for stripping off the suffixes. Based on the suffix pattern PoS is determined, which then enables the decision for the stem boundary. Thus, the presented stemming technique that does not employ a regular dictionary, is a remedy for the disambiguation problem. The performance rate of the method on golden standard PoS tagged METU-Sabancı Turkish Treebank is found to be 93.83%.
Databáze: OpenAIRE