Popis: |
As there are more and more online sources available on the Web, it becomes very time-consuming, if not impossible, to visit and search all web sites, one by one. Many search engines has been developed to help users find information of their need. However, search engines work poor for online sources whose data are often in deep web, which is not part of surface web indexed by standard search engines. Metasearch is a very popular mechanism to search deep web. Metasearch provides the capability for users to search and access all of the information sources in one query submission. One of the fundamental problems in building metasearch systems is to learn wrappers which extract and integrate data records from query result pages returned from online sources. In this paper, develop an unsupervised approach for wrapper induction that combines visual, content and HTML tag information. Our approach first learns a visual content model that alleviates HTML tag differences among data records, and then finds a tag model from all data records that match the visual content model. Experiment shows that our approach works well for data sets collected from well-known search engines and shopping websites. |