Wrapper Maintenance for Web-Data Extraction Based on Pages Features.

Autor: Kacprzyk, Janusz, Kłopotek, Mieczysław A., Wierzchoń, Sławomir T., Trojanowski, Krzysztof, Shunxian Zhou, Yaping Lin, Jingpu Wang, Xiaolin Yang
Zdroj: Intelligent Information Processing & Web Mining (9783540335207); 2006, p317-326, 10p
Abstrakt: Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel approach to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as text pattern features, annotations, and hyperlinks. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repairs wrappers correspondingly. Experiments over several real-world Web sites show that the proposed automatic approach can effectively maintain wrappers to extract desired data with high accuracy. [ABSTRACT FROM AUTHOR]
Databáze: Supplemental Index