Autor: |
Qu Jubao |
Rok vydání: |
2009 |
Předmět: |
|
Zdroj: |
2009 International Forum on Information Technology and Applications. |
DOI: |
10.1109/ifita.2009.211 |
Popis: |
The rapid development of the World Wide Web makes it become more and more important sources for people to look for useful data. A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases. This paper proposed a novel approach to automatically detecting templates from a set of example pages and extracting data in field level. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data. The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy. |
Databáze: |
OpenAIRE |
Externí odkaz: |
|