Combining Multiple Views from a Distance Based Feature Extraction for Text Classification

Autor: Fabrício Olivetti de França, Charles Henrique Porto Ferreira, Debora Maria Rossi de Medeiros
Rok vydání: 2018
Předmět:
Zdroj: CEC
DOI: 10.1109/cec.2018.8477772
Popis: Text Mining is a challenging task due to the lack of a naturally structured representation and the high dimensionality induced by the feature extraction techniques commonly used. Different feature extractions can lead to multiple views that can capture different aspects of the text documents being analyzed. The combination of these features can lead to a better accuracy in classification tasks but, also, an undesirable increase in the number of features. In this work, we investigate the use of a feature extraction technique called DCDistance used as a multiple feature extraction for text documents combined with a Genetic Algorithm based feature selection, hereby called MVDCD. The results show that the main advantage of MVDCD is that the dimensionality is reduced by more than 90% while significantly increasing the classification accuracy when compared to vanilla DCDistance and other feature selections techniques. A side effect of the use of DCDistance and MVDCD is the possibility of model interpretability, as the extracted features are explicit.
Databáze: OpenAIRE