Popis: |
Several studies have recently concentrated on the generation of wrappers for extracting data from Web data sources. The ROADRUNNER system aims at automating the tedious and expensive process of writing wrappers in an unsupervised, domain-independent, and scalable manner. The system is based on a grammar inference algorithm, called MATCH, which has been designed in a sound theoretical framework. However, in its original definition MATCH lacks in expressivity; that is, in many cases when MATCH runs over real-life Web pages, it is not able to produce a solution. In this paper we address the challenging issue of developing techniques that allow us to build upon MATCH an effective and efficient system, without renouncing to the original formal background. First, we analyze the main limitations of MATCH; then we illustrate the techniques we have developed to overcome such limitations. Finally we report on the results of some experiments, that show the efficacy of the introduced techniques and demonstrate the improvements of the overall system. |