An open, extendible, and fast Turkish morphological analyzer

Autor:	Begüm Avar, Olcay Taner Yildiz, Gokhan Ercan
Přispěvatelé:	Işık Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Işık University, Faculty of Engineering, Department of Computer Engineering, Yıldız, Olcay Taner, Ercan, Gökhan
Jazyk:	angličtina
Rok vydání:	2019
Předmět:	Rule engine 050101 languages & linguistics Spectrum analyzer Java language Trie data structures Data structures Proper nouns XML languages computer.internet_protocol Computer science Transducers Text processing Computational linguistics 02 engineering and technology Speech recognition computer.software_genre Lexicon Turkish language Trie 0202 electrical engineering electronic engineering information engineering 0501 psychology and cognitive sciences Engines Cache algorithms Finite-state machine business.industry 05 social sciences Deep learning Data structure Finite state transducers Natural language processing systems Semantics Morphological analyzer ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 020201 artificial intelligence & image processing Artificial intelligence business computer XML Natural language processing
Zdroj:	RANLP
Popis:	In this paper, we present a two-level morphological analyzer for Turkish. The morphological analyzer consists of five main components: finite state transducer, rule engine for suffixation, lexicon, trie data structure, and LRU cache. We use Java language to implement finite state machine logic and rule engine, Xml language to describe the finite state transducer rules of the Turkish language, which makes the morphological analyzer both easily extendible and easily applicable to other languages. Empowered with the comprehensiveness of a lexicon of 54,000 bare-forms including 19,000 proper nouns, our morphological analyzer presents one of the most reliable analyzers produced so far. The analyzer is compared with Turkish morphological analyzers in the literature. By using LRU cache and a trie data structure, the system can analyze 100,000 words per second, which enables users to analyze huge corpora in a few hours.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d1a2d28638193410ce6ff1c840848264 https://hdl.handle.net/11729/2300 Zobrazit plný text záznamu