WikiAutoCat: Information Retrieval System for Automatic Categorization of Wikipedia Articles

Autor: Elsayed E. Hemayed, Nesma Refaei, Riham Mansour
Rok vydání: 2018
Předmět:
Zdroj: Arabian Journal for Science and Engineering. 43:8095-8109
ISSN: 2191-4281
2193-567X
DOI: 10.1007/s13369-018-3244-9
Popis: Document categorization became a crucial task to organize the massive amount of data over the web. Moreover, many web repositories tended to classify its articles to hierarchies of topics. This structure facilitates connecting related topics and reaching articles. Wikipedia has organized its articles in a category hierarchy; but so far, the categorization process is done manually by human editors which is a confusing, tiring and a time-consuming task. In this work we propose WikiAutoCat system for automatic categorization of Wikipedia articles. It is an information retrieval system that suggests the most relevant set of categories to the article editor to simplify the categorization process. Empirical evaluation demonstrates that our system is scalable enough to perform the categorization process of such a big dataset and it achieves big improvements over the state of the art in Wikipedia categorization in accuracy by 41.65% over WikiCat-Word system and 26.83% over WikiCat-Link system. Also, it is evaluated on a benchmark dataset and achieved gains over their baseline by 8.1% in accuracy.
Databáze: OpenAIRE