A Focused Crawler by Segmentation of Context Information

Autor:	Jin Bum Kang, Joong Min Choi, Jae Young Yang, Nam Yong Lee, Chang Hee Cho
Rok vydání:	2005
Předmět:	Information retrieval Computer science Document classification InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL Context (language use) Document clustering Hyperlink Focused crawler computer.software_genre ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Relevance (information retrieval) tf–idf Web crawler computer
Zdroj:	The KIPS Transactions:PartB. :697-702
ISSN:	1598-284X
Popis:	The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date web document Indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::ee938251e9ae033b304ee7050587cf12 https://doi.org/10.3745/kipstb.2005.12b.6.697 Zobrazit plný text záznamu