An n-Gram Based Approach to Multi-Labeled Web Page Genre Classification

Autor: Vlado Keselj, Jane E. Mason, Jack Duffy, Michael Shepherd, Carolyn Watters
Rok vydání: 2010
Předmět:
Zdroj: HICSS
DOI: 10.1109/hicss.2010.58
Popis: The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre, even when the Web page belongs to more than one genre. Experiments are run on a multi-labeled data set using both an SVM classifier and a distance function classification model. These n-gram based methods had very high precision results but somewhat lower recall results, indicating that the genre labels assigned by the classifiers are quite accurate, but that these machine learning classifiers are not assigning as many labels as did the human classifiers. The classification results compare favorably with those of other researchers on the same data set.
Databáze: OpenAIRE