Abstrakt: |
In classification, one of the common problems is the class imbalance problem. This phenomenon that is growing significance emerges in most real fields and occurs when data samples are distributed among classes unevenly. This means that most of the data are in the larger class, and there are fewer data in the smaller class. Since standard classifiers do not consider the distribution of imbalanced class, they indicate undesirable behavior in facing them. Many techniques have been proposed to solve the problem of class imbalance. Among these methods, a group called preprocessing techniques tries to create a balance between training sets. These methods balance the classes' distribution by removing redundant samples from the larger class or creating new samples for the smaller one. The first group is known as under-sampling, and the second one is known as over-sampling techniques. In this paper, we propose a score-based preprocessing technique based on both under-sampling and over-sampling to overcome the weakness of classifiers in class imbalance problems. For this purpose, we apply the sharing strategy in both stages to determine more suitable samples based on their importance in the feature space. In the over-sampling stage, the smaller class's synthetic samples are generated by interpolating between more sparse samples. After that, in the under-sampling stage, denser samples of the larger class are selected to be removed. We use the binary tournament selection operator in both stages to perform over-sampling and under-sampling based on probabilities. In experiments, the support vector machine (SVM) is employed to train a classification model from the balanced training sets obtained by different preprocessing methods. Besides, F-measure and AUC measures are considered as evaluation tools. At the last step, we compare all methods in terms of the classification model's complexity. According to the results obtained from 44 standard imbalanced datasets, the proposed method's superiority and effectiveness compared to other methods have been revealed. [ABSTRACT FROM AUTHOR] |