Autor: |
Nikhil V. Chandran, V.S. Anoop, S. Asharaf |
Jazyk: |
angličtina |
Rok vydání: |
2023 |
Předmět: |
|
Zdroj: |
Results in Engineering, Vol 17, Iss , Pp 100949- (2023) |
Druh dokumentu: |
article |
ISSN: |
2590-1230 |
DOI: |
10.1016/j.rineng.2023.100949 |
Popis: |
Topic models are unsupervised machine learning techniques that output clusters of “topics” represented as co-occurring words with their associated probability distributions. Topic modeling algorithms find latent themes from large document collections by understanding their context. On the other hand, string kernels are supervised machine-learning techniques that quantify string similarities without explicit string encoding. We propose TopicStriKer, a model combining the advantages of unsupervised topic modeling with supervised string kernels for text classification tasks. The co-occurring topic words per topic and topic proportions per document obtained are used to reduce the document corpus to a topic-word sequence. This reduced representation is then used for text classification with the aid of string kernels, significantly improving accuracy and reducing training time. Experiments on the bag-of-words kernel-based string embeddings using the proposed algorithm outperform the traditional text classification approaches. This work extensively compares string kernels with topic modeling on various performance metrics to establish our findings. |
Databáze: |
Directory of Open Access Journals |
Externí odkaz: |
|