Comparing Different Term Weighting Schemas for Topic Modeling
Autor: | Florin Radulescu, Alexandru Boicea, Ciprian-Octavian Truica |
---|---|
Rok vydání: | 2016 |
Předmět: |
Topic model
Text corpus Computer science business.industry Rand index Statistical model 02 engineering and technology Machine learning computer.software_genre Latent Dirichlet allocation Weighting Matrix decomposition symbols.namesake 020204 information systems 0202 electrical engineering electronic engineering information engineering symbols 020201 artificial intelligence & image processing Artificial intelligence Data mining business Cluster analysis computer |
Zdroj: | SYNASC |
DOI: | 10.1109/synasc.2016.055 |
Popis: | Topic Modeling is a type of statistical model that tries to determine the topics present in a corpus of documents. The accuracy measures applied to clustering algorithm can also be used to assess the accuracy of topic modeling algorithms because determining topics for documents is similar with clustering them. This paper presents an experimental validation regarding the accuracy of Latent Dirichlet Allocation in comparison with Non-Negative Matrix Factorization and K-Means. The experiments use different weighting schemas when constructing the document-term matrix to determine if the accuracy of the algorithm improves. Two well known, already labeled text corpora are used for testing. The Purity and Adjusted Rand Index are used to evaluate the accuracy. Also, a time performance comparison regarding the run-time of these algorithms is presented. |
Databáze: | OpenAIRE |
Externí odkaz: |