Combining Multiple Views from a Distance Based Feature Extraction for Text Classification
Autor: | Fabrício Olivetti de França, Charles Henrique Porto Ferreira, Debora Maria Rossi de Medeiros |
---|---|
Rok vydání: | 2018 |
Předmět: |
Computer science
business.industry Feature extraction 020206 networking & telecommunications Feature selection Pattern recognition 02 engineering and technology Text mining Feature (computer vision) Genetic algorithm 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence Representation (mathematics) business Curse of dimensionality Interpretability |
Zdroj: | CEC |
DOI: | 10.1109/cec.2018.8477772 |
Popis: | Text Mining is a challenging task due to the lack of a naturally structured representation and the high dimensionality induced by the feature extraction techniques commonly used. Different feature extractions can lead to multiple views that can capture different aspects of the text documents being analyzed. The combination of these features can lead to a better accuracy in classification tasks but, also, an undesirable increase in the number of features. In this work, we investigate the use of a feature extraction technique called DCDistance used as a multiple feature extraction for text documents combined with a Genetic Algorithm based feature selection, hereby called MVDCD. The results show that the main advantage of MVDCD is that the dimensionality is reduced by more than 90% while significantly increasing the classification accuracy when compared to vanilla DCDistance and other feature selections techniques. A side effect of the use of DCDistance and MVDCD is the possibility of model interpretability, as the extracted features are explicit. |
Databáze: | OpenAIRE |
Externí odkaz: |