Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training.

Autor: Nasir IM; Department of Computer Science, HITEC University, Taxila 47080, Pakistan., Khan MA; Department of Computer Science, HITEC University, Taxila 47080, Pakistan., Yasmin M; Department of Computer Science, COMSATS University Islamabad, Wah Campus, Wah Cantonment 47040, Pakistan., Shah JH; Department of Computer Science, COMSATS University Islamabad, Wah Campus, Wah Cantonment 47040, Pakistan., Gabryel M; Department of Intelligent Computer Systems, Częstochowa University of Technology, 42-200 Częstochowa, Poland., Scherer R; Department of Intelligent Computer Systems, Częstochowa University of Technology, 42-200 Częstochowa, Poland., Damaševičius R; Faculty of Applied Mathematics, Silesian University of Technology, 44-100 Gliwice, Poland.
Jazyk: angličtina
Zdroj: Sensors (Basel, Switzerland) [Sensors (Basel)] 2020 Nov 27; Vol. 20 (23). Date of Electronic Publication: 2020 Nov 27.
DOI: 10.3390/s20236793
Abstrakt: Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique's major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.
Databáze: MEDLINE
Nepřihlášeným uživatelům se plný text nezobrazuje