Popis: |
When a search engine returns query results to users, it always returns the number of relevant documents (i.e., hits). The text line containing the hits number is called as hits line. The hits number can be used in several applications such as building meta-search engine, estimating the size and the relevance of search engines. Since the hits line is mixed with other text lines in the result page, it is difficult to automatically recognize and extract the line from the result page. To this end, decision tree techniques are employed together with a heuristic approach to build two filters to automatically identify the hits line. First, texts in result pages are automatically extracted in lines. Then four key features are identified and used to build a decision tree based on the learning sample search engines. Classification rules from the tree are built to serve as the first filter to recognize the extracted text lines. To reduce the mis-classification of the first filter, the second filter is constructed using a heuristic weighting approach. The experiment based on 100 search engines shows that the accuracy of 10-fold cross-validation is up to 95%. |