FCFilter: Feature selection based on clustering and genetic algorithms
Autor: | Fabiana Soares Santana, Debora Maria Rossi de Medeiros, Charles Henrique Porto Ferreira |
---|---|
Rok vydání: | 2016 |
Předmět: |
business.industry
Computer science 05 social sciences Feature selection Pattern recognition 02 engineering and technology computer.software_genre Data modeling Support vector machine Set (abstract data type) ComputingMethodologies_PATTERNRECOGNITION Text mining Feature (computer vision) 0502 economics and business 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Data mining Artificial intelligence Cluster analysis business computer 050203 business & management Selection (genetic algorithm) Curse of dimensionality |
Zdroj: | CEC |
DOI: | 10.1109/cec.2016.7744048 |
Popis: | The search for patterns in big amounts of textual data, or text mining, can be at once rewarding and challenging. The patterns can reveal tendencies, similarities and predictions, but the information is usually implicit and difficult to be validated. Classification is one of the most relevant research areas in text mining, and it usually consists of predicting the class of a textual document based on a set of documents previously organized into different classes, such as author or topic. Choosing the words to compose the feature set is crucial to a proper classification. A well selected feature set can improve the performance of the classification method and enlighten the interpretation of the classification model adjusted to the data. This paper introduces the Feature Cluster Filter (FCFilter) method for feature selection. FCFilter eliminates the need to input or optimize the number of clusters by grouping the words in a sufficiently high number of clusters. Genetic algorithms are applied to optimize the combination of groups that will provide the final feature set. The method is based on the selection of features that are good predictors for text classification by clustering features and selecting only the suitable clusters. Experiments performed to evaluate the FCFilter with the Reuters-21578, SCY-Genes and SCY-Clusters datasets showed a significant reduction in the feature-value table dimensionality with slight improvements in the classification accuracy when compared to the baselines. The results are very promising, indicating potential improvements in the research on feature selection for text mining. |
Databáze: | OpenAIRE |
Externí odkaz: |