Automatic retrieval of similar content using search engine query interface
Autor: | Chris Drome, Santanu Kolay, Paolo D'Alberto, Ali Dasdan |
---|---|
Rok vydání: | 2009 |
Předmět: |
Web search query
Information retrieval Computer science Keyword extraction Query optimization Query language Ranking (information retrieval) Search engine Query expansion Web query classification ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Sargable computer RDF query language computer.programming_language |
Zdroj: | CIKM |
DOI: | 10.1145/1645953.1646043 |
Popis: | We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations. |
Databáze: | OpenAIRE |
Externí odkaz: |