A hybrid Statistical Approach to Stemming in Turkish: An Agglutinative Language
Autor: | Bahar Karaoglan, Tarik Kisla |
---|---|
Přispěvatelé: | Ege Üniversitesi |
Rok vydání: | 2016 |
Předmět: |
Agglutinative language
Vocabulary Lexeme Turkish Computer science media_common.quotation_subject Speech recognition Mühendislik Stemming Natural Language Processing Turkish Agglutinative Language 02 engineering and technology computer.software_genre Engineering Noun Stripping (linguistics) 0202 electrical engineering electronic engineering information engineering Ortak Disiplinler media_common Infinite number business.industry 05 social sciences 020207 software engineering General Medicine language.human_language language Artificial intelligence 0509 other social sciences Suffix 050904 information & library sciences business computer Natural language processing |
Zdroj: | Volume: 17, Issue: 2 401-412 Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering |
ISSN: | 2146-0205 1302-3160 |
DOI: | 10.18038/btda.31812 |
Popis: | Finding Stem is a complicated and important issue for agglutinative languages like Turkish where theoretically infinite number of surface forms can be obtained from a single lexeme. Both analytical and statistical approaches have been tried for stemming Turkish words. Two main problems that become apparent with these approaches are the involvement of a dictionary which enforces the assumption of closed vocabulary and the disambiguation of the actual stem among the numerous candidates. Here, we present a method that exploits the simple fact that nouns and verbs have different suffix patterns. We also use statistical methods which are used for stripping off the suffixes. Based on the suffix pattern PoS is determined, which then enables the decision for the stem boundary. Thus, the presented stemming technique that does not employ a regular dictionary, is a remedy for the disambiguation problem. The performance rate of the method on golden standard PoS tagged METU-Sabancı Turkish Treebank is found to be 93.83%. |
Databáze: | OpenAIRE |
Externí odkaz: |