An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts
Autor: | Menglin Lu, Wei Wei, Chonghui Guo |
---|---|
Rok vydání: | 2019 |
Předmět: |
Topic model
Computer science 020209 energy 02 engineering and technology computer.software_genre 01 natural sciences Latent Dirichlet allocation 010104 statistics & probability symbols.namesake Document level Artificial Intelligence Lightweight Directory Access Protocol 0202 electrical engineering electronic engineering information engineering 0101 mathematics business.industry Partition (database) Computer Science Applications ComputingMethodologies_PATTERNRECOGNITION symbols Business Management and Accounting (miscellaneous) Granularity Artificial intelligence Statistics Probability and Uncertainty business computer Natural language processing |
Zdroj: | Annals of Data Science. 8:331-344 |
ISSN: | 2198-5812 2198-5804 |
DOI: | 10.1007/s40745-019-00218-3 |
Popis: | Latent Dirichlet Allocation (LDA) is a topic model that represents a document as a distribution of multiple topics. It expresses each topic as a distribution of multiple words by mining semantic relationships hidden in text. However, traditional LDA ignores some of the semantic features hidden inside the document semantic structure of medium and long texts. Instead of using the original LDA to model the topic at the document level, it is better to refine the document into different semantic topic units. In this paper, we propose an improved LDA topic model based on partition (LDAP) for medium and long texts. LDAP not only preserves the benefits of the original LDA but also refines the modeled granularity from the document level to the semantic topic level, which is particularly suitable for the topic modeling of the medium and long text. The extensive experimental classification results on Fudan University corpus and Sougou Lab corpus demonstrate that LDAP achieves better performance compared with other topic models, such as LDA, HDP, LSA and doc2vec. |
Databáze: | OpenAIRE |
Externí odkaz: |