A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Autor: Guannan Hu, Zhiguo Lu, Jianyue Ni, Wenhao Zhu, Yiting Liu
Rok vydání: 2018
Předmět:
Zdroj: Wireless Personal Communications. 102:3851-3867
ISSN: 1572-834X
0929-6212
Popis: Text classification is a topic in natural language processing that is particularly useful for Internet information processing. Methods based on supervised learning require a large amount of manually annotated training samples. The annotation of training samples is time consuming, and performance relies heavily on the quality of the training samples. This paper presents a text classification method based on sample extension. The extension is based on the correlation of the labeled sample data and the concepts in Wikipedia. Combined with the rich link relationships between concepts, we selected appropriate articles from Wikipedia to expand the training sample set. By introducing the large amount of rich semantic concept pages that are contained in Wikipedia along with links that are related to different pages, our approach enhances the performance and generalization of the classifier. Experiments demonstrate that the performance of the method proposed in this paper is better than that of both supervised and semi-supervised methods.
Databáze: OpenAIRE