An open, extendible, and fast Turkish morphological analyzer
Autor: | Begüm Avar, Olcay Taner Yildiz, Gokhan Ercan |
---|---|
Přispěvatelé: | Işık Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Işık University, Faculty of Engineering, Department of Computer Engineering, Yıldız, Olcay Taner, Ercan, Gökhan |
Jazyk: | angličtina |
Rok vydání: | 2019 |
Předmět: |
Rule engine
050101 languages & linguistics Spectrum analyzer Java language Trie data structures Data structures Proper nouns XML languages computer.internet_protocol Computer science Transducers Text processing Computational linguistics 02 engineering and technology Speech recognition computer.software_genre Lexicon Turkish language Trie 0202 electrical engineering electronic engineering information engineering 0501 psychology and cognitive sciences Engines Cache algorithms Finite-state machine business.industry 05 social sciences Deep learning Data structure Finite state transducers Natural language processing systems Semantics Morphological analyzer ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 020201 artificial intelligence & image processing Artificial intelligence business computer XML Natural language processing |
Zdroj: | RANLP |
Popis: | In this paper, we present a two-level morphological analyzer for Turkish. The morphological analyzer consists of five main components: finite state transducer, rule engine for suffixation, lexicon, trie data structure, and LRU cache. We use Java language to implement finite state machine logic and rule engine, Xml language to describe the finite state transducer rules of the Turkish language, which makes the morphological analyzer both easily extendible and easily applicable to other languages. Empowered with the comprehensiveness of a lexicon of 54,000 bare-forms including 19,000 proper nouns, our morphological analyzer presents one of the most reliable analyzers produced so far. The analyzer is compared with Turkish morphological analyzers in the literature. By using LRU cache and a trie data structure, the system can analyze 100,000 words per second, which enables users to analyze huge corpora in a few hours. |
Databáze: | OpenAIRE |
Externí odkaz: |