Structured Object Matching across Web Page Revisions

Autor: Felix Naumann, Leon Bornemann, Dmitri V. Kalashnikov, Tobias Bleifuß, Divesh Srivastava
Rok vydání: 2021
Předmět:
Zdroj: ICDE
DOI: 10.1109/icde51399.2021.00115
Popis: A considerable amount of useful information on the web is (semi-)structured, such as tables and lists. An extensive corpus of prior work addresses the problem of making these human-readable representations interpretable by algorithms. Most of these works focus only on the most recent snapshot of these web objects. However, their evolution over time represents valuable information that has barely been tapped, enabling various applications, including visual change exploration and trust assessment. To realize the full potential of this information, it is critical to match such objects across page revisions.In this work, we present novel techniques that match tables, infoboxes and lists within a page across page revisions. We are, thus, able to extract the evolution of structured information in various forms from a long series of web page revisions. We evaluate our approach on a representative sample of pages and measure the number of correct matches. Our approach achieves a significant improvement in object matching over baselines and over related work.
Databáze: OpenAIRE