TopicStriKer: A topic kernels-powered approach for text classification

Autor: Nikhil V. Chandran, V.S. Anoop, S. Asharaf
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: Results in Engineering, Vol 17, Iss , Pp 100949- (2023)
Druh dokumentu: article
ISSN: 2590-1230
DOI: 10.1016/j.rineng.2023.100949
Popis: Topic models are unsupervised machine learning techniques that output clusters of “topics” represented as co-occurring words with their associated probability distributions. Topic modeling algorithms find latent themes from large document collections by understanding their context. On the other hand, string kernels are supervised machine-learning techniques that quantify string similarities without explicit string encoding. We propose TopicStriKer, a model combining the advantages of unsupervised topic modeling with supervised string kernels for text classification tasks. The co-occurring topic words per topic and topic proportions per document obtained are used to reduce the document corpus to a topic-word sequence. This reduced representation is then used for text classification with the aid of string kernels, significantly improving accuracy and reducing training time. Experiments on the bag-of-words kernel-based string embeddings using the proposed algorithm outperform the traditional text classification approaches. This work extensively compares string kernels with topic modeling on various performance metrics to establish our findings.
Databáze: Directory of Open Access Journals