Combination of Bayesian and Latent Semantic Analysis with Domain Specific Knowledge

Autor: Shen Lu, Richard S. Segall
Jazyk: angličtina
Rok vydání: 2016
Předmět:
Zdroj: Journal of Systemics, Cybernetics and Informatics, Vol 14, Iss 3, Pp 43-50 (2016)
Druh dokumentu: article
ISSN: 1690-4524
Popis: With the development of information technology, electronic publications become popular. However, it is a challenge to retrieve information from electronic publications because the large amount of words, the synonymy problem and the polysemi problem. In this paper, we introduced a new algorithm called Bayesian Latent Semantic Analysis (BLSA). We chose to model text not based on terms but associations between words. Also, the significance of interesting features were improved by expand the number of similar terms with glossaries. Latent Semantic Analysis (LSA) was chosen to discover significant features. Bayesian post probability was used to discover segmentation boundaries. Also, Dirchlet distribution was chosen to present the vector of topic distribution and calculate the maximum probability of the topics. Experimental results showed us that both Pk [8] and WindowsDiff [27] decreased 10% by using BLSA in comparison to the Lexical Cohesion with the original data. [8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990), 'Indexing by latent semantic analysis', Journal of the American Society for Information Science, vol. 41, n.6, pp. 391-407. [27] Pevzner, L. and Hearst, M.A. (2002). A critique and improvement of an evaluation metric for text segmentation, Computational Linguistics, vol. 28, no. 1, pp. 19-36.
Databáze: Directory of Open Access Journals