Comparing Different Term Weighting Schemas for Topic Modeling

Autor: Florin Radulescu, Alexandru Boicea, Ciprian-Octavian Truica
Rok vydání: 2016
Předmět:
Zdroj: SYNASC
DOI: 10.1109/synasc.2016.055
Popis: Topic Modeling is a type of statistical model that tries to determine the topics present in a corpus of documents. The accuracy measures applied to clustering algorithm can also be used to assess the accuracy of topic modeling algorithms because determining topics for documents is similar with clustering them. This paper presents an experimental validation regarding the accuracy of Latent Dirichlet Allocation in comparison with Non-Negative Matrix Factorization and K-Means. The experiments use different weighting schemas when constructing the document-term matrix to determine if the accuracy of the algorithm improves. Two well known, already labeled text corpora are used for testing. The Purity and Adjusted Rand Index are used to evaluate the accuracy. Also, a time performance comparison regarding the run-time of these algorithms is presented.
Databáze: OpenAIRE