Document indexing in text categorization

Autor:	Jing-Hua Tan, Shoubin Dong, Qi-Rui Zhang, Ling Zhang
Rok vydání:	2005
Předmět:	Vocabulary business.industry Computer science media_common.quotation_subject Search engine indexing Feature selection Function (mathematics) computer.software_genre Boosting methods for object categorization Term (time) Text mining Categorization Artificial intelligence business tf–idf computer Natural language processing media_common
Zdroj:	2005 International Conference on Machine Learning and Cybernetics.
DOI:	10.1109/icmlc.2005.1527600
Popis:	Aiming at the characteristic of text categorization, this paper proposes an improved method of computing term weights, tfidfie, based on the traditional tfidf function that is generally used in most classifiers. In comparison with the tfidf function, the tfidfie function adds an information entropy factor, H, which represents the distribution of documents in the training set in which the term occurs. The experiments show tfidfie outperforms tfidf. In addition, this paper analyses the difference of using information entropy factor H between document categorization and feature selection, also finds that both two phases are all necessary for text categorization, meanwhile it can reach the best performance of classification with up to 70% of the unique terms being removed.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f6425f9f741e490fca3939c3e5e6d949 https://doi.org/10.1109/icmlc.2005.1527600 Zobrazit plný text záznamu