Automatic retrieval of similar content using search engine query interface

Autor: Chris Drome, Santanu Kolay, Paolo D'Alberto, Ali Dasdan
Rok vydání: 2009
Předmět:
Zdroj: CIKM
DOI: 10.1145/1645953.1646043
Popis: We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.
Databáze: OpenAIRE