An n-Gram Based Approach to Multi-Labeled Web Page Genre Classification
Autor: | Vlado Keselj, Jane E. Mason, Jack Duffy, Michael Shepherd, Carolyn Watters |
---|---|
Rok vydání: | 2010 |
Předmět: |
Computer science
business.industry Machine learning computer.software_genre Popularity Data set Support vector machine Statistical classification ComputingMethodologies_PATTERNRECOGNITION n-gram Web page The Internet Artificial intelligence business Representation (mathematics) computer Natural language processing |
Zdroj: | HICSS |
DOI: | 10.1109/hicss.2010.58 |
Popis: | The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre, even when the Web page belongs to more than one genre. Experiments are run on a multi-labeled data set using both an SVM classifier and a distance function classification model. These n-gram based methods had very high precision results but somewhat lower recall results, indicating that the genre labels assigned by the classifiers are quite accurate, but that these machine learning classifiers are not assigning as many labels as did the human classifiers. The classification results compare favorably with those of other researchers on the same data set. |
Databáze: | OpenAIRE |
Externí odkaz: |