A Sample Extension Method Based on Wikipedia and Its Application in Text Classification
Autor: | Guannan Hu, Zhiguo Lu, Jianyue Ni, Wenhao Zhu, Yiting Liu |
---|---|
Rok vydání: | 2018 |
Předmět: |
Computer science
business.industry Supervised learning Information processing Sample (statistics) 02 engineering and technology Semi-supervised learning computer.software_genre Computer Science Applications Set (abstract data type) 020204 information systems 0202 electrical engineering electronic engineering information engineering Extension method 020201 artificial intelligence & image processing The Internet Artificial intelligence Electrical and Electronic Engineering business Classifier (UML) computer Natural language processing |
Zdroj: | Wireless Personal Communications. 102:3851-3867 |
ISSN: | 1572-834X 0929-6212 |
Popis: | Text classification is a topic in natural language processing that is particularly useful for Internet information processing. Methods based on supervised learning require a large amount of manually annotated training samples. The annotation of training samples is time consuming, and performance relies heavily on the quality of the training samples. This paper presents a text classification method based on sample extension. The extension is based on the correlation of the labeled sample data and the concepts in Wikipedia. Combined with the rich link relationships between concepts, we selected appropriate articles from Wikipedia to expand the training sample set. By introducing the large amount of rich semantic concept pages that are contained in Wikipedia along with links that are related to different pages, our approach enhances the performance and generalization of the classifier. Experiments demonstrate that the performance of the method proposed in this paper is better than that of both supervised and semi-supervised methods. |
Databáze: | OpenAIRE |
Externí odkaz: |