A Focused Crawler with Document Segmentation
Autor: | Jinbeom Kang, Jaeyoung Yang, Joongmin Choi |
---|---|
Rok vydání: | 2005 |
Předmět: |
Anchor text
Information retrieval Computer science Document classification InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL Focused crawler Hyperlink Document clustering Document processing computer.software_genre Search engine ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Relevance (information retrieval) Web crawler tf–idf computer |
Zdroj: | Lecture Notes in Computer Science ISBN: 9783540269724 IDEAL |
DOI: | 10.1007/11508069_13 |
Popis: | The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date Web document indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document. |
Databáze: | OpenAIRE |
Externí odkaz: |