Popis: |
In this paper, we would like to present an algorithm to determine the repeating patterns inside the DOM tree of a webpage. By doing this we can cluster the content inside a web page and obtain more relevant structured data. The determined DOM structure can be used to mine other web pages that are similar in structure and one hop away from the initial targeted web page. Also, the clusters are similar in structure not in contents, and our method is based on in-page clustering. This is what differentiates our algorithm from similar technologies that work on entire web pages. |