Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem
Autor: | Roni Stern, Nir Ofek, Lior Rokach, Asaf Shabtai |
---|---|
Rok vydání: | 2017 |
Předmět: |
0209 industrial biotechnology
business.industry Computer science Cognitive Neuroscience Sentiment analysis Pareto principle 02 engineering and technology Intrusion detection system computer.software_genre Machine learning Computer Science Applications Statistical classification ComputingMethodologies_PATTERNRECOGNITION 020901 industrial engineering & automation Artificial Intelligence Undersampling 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence Data mining business Cluster analysis Time complexity Classifier (UML) computer |
Zdroj: | Neurocomputing. 243:88-102 |
ISSN: | 0925-2312 |
Popis: | Datasets that have imbalanced class distributions pose a challenge for learning and classification algorithms. Imbalanced datasets exist in many domains, such as: fraud detection, sentiment analysis, churn prediction, and intrusion detection in computer networks. To solve the imbalance problem, three main approaches are typically used: data resampling, method adaptation and cost-sensitive learning; of these, data resampling, either oversampling the minority class instances or undersampling the majority class instances, is the most used approach. However, in most cases, when implementing these approaches, there is a trade-off between the predictive performance and the complexity. In this paper we introduce a fast, novel clustering-based undersampling technique for addressing binary-class imbalance problems, which demonstrates high predictive performance, while its time complexity is bound by the size of the minority class instances. During the training phase, the algorithm clusters the minority instances and selects a similar number of majority instances from each cluster. A specific classifier is then trained for each cluster. An unlabeled instance is classified as the majority class if it does not fit into any of the clusters. Otherwise, cluster-specific classifiers are used to return the instance's classification, and the results are weighted by the inverse-distance from the clusters. Our evaluation includes several state-of-the-art methods. We plot the Pareto frontier for various datasets, to consider both computational cost and predictive performance measures. Extensive sets of experiments demonstrate that only the suggested method is always found on the frontier. |
Databáze: | OpenAIRE |
Externí odkaz: |