Popis: |
This thesis describes the design and implementation of an algorithm that, using some initial hints from the user, converts data in HTML documents generated from a database and inteded for human readability, into a structured form suitable for computer processing. The input document is assumed to have some structure (usually a visual layout) and the user must provide a sample of semantically labelled items in the document. The output is expected to reflect the semantic structure of the provided data. The resulting application is composed of an editor part which includes a graphical tool for easy labelling of sample items, and a server part, which includes a tool for the subsequent mass processing of additional documents. The application was tested on real estate advertising webs and the results of the testing were analysed. The thesis also surveys other existing applications based on similar principles and provides their comparison. |