Automated internal web page clustering for improved data extraction

Autor:	George Pecherle, Cornelia A. Győrödi, Mihai Cornea, Robert Ş. Győrödi
Rok vydání:	2012
Předmět:	Information retrieval Data extraction Web mining Computer science Web page Static web page Document Object Model Cluster analysis Site map Data Web
Zdroj:	WIMS
DOI:	10.1145/2254129.2254209
Popis:	In this paper, we would like to present an algorithm to determine the repeating patterns inside the DOM tree of a webpage. By doing this we can cluster the content inside a web page and obtain more relevant structured data. The determined DOM structure can be used to mine other web pages that are similar in structure and one hop away from the initial targeted web page. Also, the clusters are similar in structure not in contents, and our method is based on in-page clustering. This is what differentiates our algorithm from similar technologies that work on entire web pages.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::77c492bf2bed3f881651e91af65d0081 https://doi.org/10.1145/2254129.2254209 Zobrazit plný text záznamu