Text Categorization with Latent Dirichlet Allocation

Autor: ZLACKÝ Daniel, STAŠ Ján, JUHÁR Jozef, CIŽMÁR Anton
Jazyk: angličtina
Rok vydání: 2014
Předmět:
Zdroj: Journal of Electrical and Electronics Engineering, Vol 7, Iss 1, Pp 161-164 (2014)
Druh dokumentu: article
ISSN: 1844-6035
2067-2128
Popis: This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.
Databáze: Directory of Open Access Journals