Unsupervised Lexicon-Based Stemming by Dual Dictionary Models

Autor:	Aysegul Karcili, Can Ozbey
Rok vydání:	2021
Předmět:	business.industry Computer science Pattern recognition Data_CODINGANDINFORMATIONTHEORY Lexicon Prefix Expectation–maximization algorithm Segmentation Artificial intelligence Suffix Greedy algorithm business Time complexity Word (computer architecture)
Zdroj:	2021 Innovations in Intelligent Systems and Applications Conference (ASYU).
DOI:	10.1109/asyu52992.2021.9599035
Popis:	In this paper, we present an unsupervised statistical method for detecting stems given a large set of words. The idea is to find the segmentation point within a word that yields the most likely prefix-suffix pair using dynamic dual dictionaries. We initialize prefix and suffix dictionaries through a greedy heuristic based on two-way lookup, and then iteratively apply expectation maximization to update them with corresponding probabilities of all prefix-suffix pairs. We also provide a nonparametric technique for scaling suffix probabilities with respect to that of prefixes by minimizing the sum of squared word reconstruction errors. Evaluation of the method is conducted on manually labeled test sets from a large collection of words obtained from Turkish Wikipedia along with another unsupervised morphological segmentation model, namely, Morfessor. As a result, the proposed model has been found to have higher accuracy in detecting correct stems with lower time complexity.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::5f145aa45ef6f053a8a55900b22e1542 https://doi.org/10.1109/asyu52992.2021.9599035 Zobrazit plný text záznamu