Modeling Topics in DFA-Based Lemmatized Gujarati Text.

Autor: Chauhan U; Department of Computer Engineering, Vishwakarma Government Engineering College, Chandkheda, Ahmedabad 382424, India., Shah S; Department of Computer Engineering, Vishwakarma Government Engineering College, Chandkheda, Ahmedabad 382424, India., Shiroya D; Department of Computer Engineering, Vishwakarma Government Engineering College, Chandkheda, Ahmedabad 382424, India., Solanki D; Department of Computer Engineering, Vishwakarma Government Engineering College, Chandkheda, Ahmedabad 382424, India., Patel Z; Department of Computer Engineering, Vishwakarma Government Engineering College, Chandkheda, Ahmedabad 382424, India., Bhatia J; Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad 382481, India., Tanwar S; Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad 382481, India., Sharma R; Ravi Sharma, Centre for Inter-Disciplinary Research and Innovation, University of Petroleum and Energy Studies, Dehradun 248001, India., Marina V; Faculty of Civil Engineering and Building Services, Department of Building Services, Technical University of Gheorghe Asachi, 700050 Iasi, Romania., Raboaca MS; Doctoral School, University Politehnica of Bucharest, Splaiul Independentei Street No. 313, 060042 Bucharest, Romania.; National Research and Development Institute for Cryogenic and Isotopic Technologies-ICSI Rm. Vâlcea, Uzinei Street, No. 4, P.O. Box 7 Râureni, 240050 Râmnicu Vâlcea, Romania.
Jazyk: angličtina
Zdroj: Sensors (Basel, Switzerland) [Sensors (Basel)] 2023 Mar 01; Vol. 23 (5). Date of Electronic Publication: 2023 Mar 01.
DOI: 10.3390/s23052708
Abstrakt: Topic modeling is a machine learning algorithm based on statistics that follows unsupervised machine learning techniques for mapping a high-dimensional corpus to a low-dimensional topical subspace, but it could be better. A topic model's topic is expected to be interpretable as a concept, i.e., correspond to human understanding of a topic occurring in texts. While discovering corpus themes, inference constantly uses vocabulary that impacts topic quality due to its size. Inflectional forms are in the corpus. Since words frequently appear in the same sentence and are likely to have a latent topic, practically all topic models rely on co-occurrence signals between various terms in the corpus. The topics get weaker because of the abundance of distinct tokens in languages with extensive inflectional morphology. Lemmatization is often used to preempt this problem. Gujarati is one of the morphologically rich languages, as a word may have several inflectional forms. This paper proposes a deterministic finite automaton (DFA) based lemmatization technique for the Gujarati language to transform lemmas into their root words. The set of topics is then inferred from this lemmatized corpus of Gujarati text. We employ statistical divergence measurements to identify semantically less coherent (overly general) topics. The result shows that the lemmatized Gujarati corpus learns more interpretable and meaningful subjects than unlemmatized text. Finally, results show that lemmatization curtails the size of vocabulary decreases by 16% and the semantic coherence for all three measurements-Log Conditional Probability, Pointwise Mutual Information, and Normalized Pointwise Mutual Information-from -9.39 to -7.49, -6.79 to -5.18, and -0.23 to -0.17, respectively.
Databáze: MEDLINE
Nepřihlášeným uživatelům se plný text nezobrazuje