Extending WHIRL with background knowledge for improved text classification

Autor:	William W. Cohen, Haym Hirsh, Sarah Zelikovitz
Rok vydání:	2006
Předmět:	Information retrieval business.industry Computer science Relational database Process (engineering) Word processing Semi-supervised learning Library and Information Sciences Machine learning computer.software_genre Set (abstract data type) Similarity (psychology) Pattern recognition (psychology) The Internet Artificial intelligence business computer Information Systems
Zdroj:	Information Retrieval. 10:35-67
ISSN:	1573-7659 1386-4564
DOI:	10.1007/s10791-006-9004-6
Popis:	Intelligent use of the many diverse forms of data available on the Internet requires new tools for managing and manipulating heterogeneous forms of information. This paper uses WHIRL, an extension of relational databases that can manipulate textual data using statistical similarity measures developed by the information retrieval community. We show that although WHIRL is designed for more general similarity-based reasoning tasks, it is competitive with mature systems designed explicitly for inductive classification. In particular, WHIRL is well suited for combining different sources of knowledge in the classification process. We show on a diverse set of tasks that the use of appropriate sets of unlabeled background knowledge often decreases error rates, particularly if the number of examples or the size of the strings in the training set is small. This is especially useful when labeling text is a labor-intensive job and when there is a large amount of information available about a particular problem on the World Wide Web.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::1cbae3999e6ceac9fb9e81b6b6f7dd55 https://doi.org/10.1007/s10791-006-9004-6 Zobrazit plný text záznamu Full text from SpringerLink