Unsupervised Lexicon-Based Stemming by Dual Dictionary Models

Autor: Aysegul Karcili, Can Ozbey
Rok vydání: 2021
Předmět:
Zdroj: 2021 Innovations in Intelligent Systems and Applications Conference (ASYU).
DOI: 10.1109/asyu52992.2021.9599035
Popis: In this paper, we present an unsupervised statistical method for detecting stems given a large set of words. The idea is to find the segmentation point within a word that yields the most likely prefix-suffix pair using dynamic dual dictionaries. We initialize prefix and suffix dictionaries through a greedy heuristic based on two-way lookup, and then iteratively apply expectation maximization to update them with corresponding probabilities of all prefix-suffix pairs. We also provide a nonparametric technique for scaling suffix probabilities with respect to that of prefixes by minimizing the sum of squared word reconstruction errors. Evaluation of the method is conducted on manually labeled test sets from a large collection of words obtained from Turkish Wikipedia along with another unsupervised morphological segmentation model, namely, Morfessor. As a result, the proposed model has been found to have higher accuracy in detecting correct stems with lower time complexity.
Databáze: OpenAIRE