An open, extendible, and fast Turkish morphological analyzer

Autor: Begüm Avar, Olcay Taner Yildiz, Gokhan Ercan
Přispěvatelé: Işık Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Işık University, Faculty of Engineering, Department of Computer Engineering, Yıldız, Olcay Taner, Ercan, Gökhan
Jazyk: angličtina
Rok vydání: 2019
Předmět:
Rule engine
050101 languages & linguistics
Spectrum analyzer
Java language
Trie data structures
Data structures
Proper nouns
XML languages
computer.internet_protocol
Computer science
Transducers
Text processing
Computational linguistics
02 engineering and technology
Speech recognition
computer.software_genre
Lexicon
Turkish language
Trie
0202 electrical engineering
electronic engineering
information engineering

0501 psychology and cognitive sciences
Engines
Cache algorithms
Finite-state machine
business.industry
05 social sciences
Deep learning
Data structure
Finite state transducers
Natural language processing systems
Semantics
Morphological analyzer
ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
020201 artificial intelligence & image processing
Artificial intelligence
business
computer
XML
Natural language processing
Zdroj: RANLP
Popis: In this paper, we present a two-level morphological analyzer for Turkish. The morphological analyzer consists of five main components: finite state transducer, rule engine for suffixation, lexicon, trie data structure, and LRU cache. We use Java language to implement finite state machine logic and rule engine, Xml language to describe the finite state transducer rules of the Turkish language, which makes the morphological analyzer both easily extendible and easily applicable to other languages. Empowered with the comprehensiveness of a lexicon of 54,000 bare-forms including 19,000 proper nouns, our morphological analyzer presents one of the most reliable analyzers produced so far. The analyzer is compared with Turkish morphological analyzers in the literature. By using LRU cache and a trie data structure, the system can analyze 100,000 words per second, which enables users to analyze huge corpora in a few hours.
Databáze: OpenAIRE