Automated internal web page clustering for improved data extraction

Autor: George Pecherle, Cornelia A. Győrödi, Mihai Cornea, Robert Ş. Győrödi
Rok vydání: 2012
Předmět:
Zdroj: WIMS
DOI: 10.1145/2254129.2254209
Popis: In this paper, we would like to present an algorithm to determine the repeating patterns inside the DOM tree of a webpage. By doing this we can cluster the content inside a web page and obtain more relevant structured data. The determined DOM structure can be used to mine other web pages that are similar in structure and one hop away from the initial targeted web page. Also, the clusters are similar in structure not in contents, and our method is based on in-page clustering. This is what differentiates our algorithm from similar technologies that work on entire web pages.
Databáze: OpenAIRE