The Research of Automatic Extraction Dynamic Web Data

Autor:	Qu Jubao
Rok vydání:	2009
Předmět:	Set (abstract data type) Template Information engineering Computer science Web page Dynamic web page Data mining Cluster analysis Application software computer.software_genre computer Field (computer science)
Zdroj:	2009 International Forum on Information Technology and Applications.
DOI:	10.1109/ifita.2009.211
Popis:	The rapid development of the World Wide Web makes it become more and more important sources for people to look for useful data. A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases. This paper proposed a novel approach to automatically detecting templates from a set of example pages and extracting data in field level. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data. The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::e475a1eb53598aa2a0df4c342e7c8466 https://doi.org/10.1109/ifita.2009.211 Zobrazit plný text záznamu