Popis: |
In class imbalanced data set, one class contains more instances than the other class and it is a critical problem in data mining. Many approaches such as oversampling, undersampling, and cost sensitive methods are developed to mitigate the effects of class imbalance but these methods suffer from various shortcomings. In the existing methods, the researchers have hardly used normalization on the imbalanced data set to mitigate the effects. In this work, we implemented two state-of-the-art data balancing methods, Random Undersampling (RUS) and Random Oversampling (ROS), ensembled by AdaBoost algorithm. Then we investigated and compared the two methods with a recently developed approach called Random Splitting data balancing (SplitBal) method with and without applying normalization on the imbalanced data set. For normalization, three well known normalization techniques are used called min-max, z-score and robust-scaling normalization. Our concerned approach, SplitBal is an ensemble method which firstly converts the imbalanced data set into several balanced data set. From the balanced data set, multiple classification models are built and ensembled by max ensemble rule. The empirical analysis using fifteen imbalanced data set elucidates that SplitBal with min-max normalization is dominant over the concerned data balancing methods in this work for Random Forest classifier. |