Use Context Information to Improve the Performance of LatentDirichlet Allocation
Autor: | Che-Yi Lin, 林哲毅 |
---|---|
Rok vydání: | 2014 |
Druh dokumentu: | 學位論文 ; thesis |
Popis: | 102 Latent Dirichlet Allocation (LDA), is a wildly used topic model for discovering the topics in documents, however it suffers from many problems like lack of dependency between words and sparse data. The main cause of these problems is the word-sense disambiguation in the natural language. In previous works, they ignore the assumption of "bag of words" and add the dependency between each words. However, we use different approach. In order to solve these problems, we proposed a topic model called context LDA (CLDA) model. The CLDA model first build up concept vectors with context information at each position and use these vectors to distinguish the equivalent relationships between word, then we present a topic model which can take these relationship as input and model the words into latent topics. The CLDA model can not only overcome the word disambiguation problem but also be easily parallelized and extended. With some extra knowledge and slight modification, we show that our model can solve the sparse data problem easily. We conduct several experiments based on 20Newsgroup dataset; in the results we show that our model can actually improve the performance of the original LDA and fix the imbalance topic problem via using the vectors and equivalent relationships. Finally we show the examples of latent topics produced by the LDA model and our model. |
Databáze: | Networked Digital Library of Theses & Dissertations |
Externí odkaz: |