A novel data balancing technique via resampling majority and minority classes toward effective classification.

Autor: Hasan, Mahmudul, Rabbi, Md. Fazle, Sultan, Md. Nahid, Nitu, Adiba Mahjabin, Uddin, Md. Palash
Předmět:
Zdroj: Telkomnika; Dec2023, Vol. 21 Issue 6, p1308-1316, 9p
Abstrakt: Classification is a predictive modelling task in machine learning (ML), where the class label is determined for a specific example of predefined features. In determining handwriting characters, identifying spam, detecting disease, identifying signals, and so on, classification requires training data with many features and label instances. In medical informatics, high precision and recall are mandatory issues besides the high accuracy of the ML classifiers. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques perform the whole dataset at a time that sometimes causes overfitting and underfitting. We propose a data balancing technique that follows the divide and conquer procedure to cluster the dataset into several segments, and both oversampling and undersampling operation is performed on each cluster. Finally, the cluster joined together and built a balanced dataset. We chose the sample data of two heart disease datasets: Hungarian and Long Beach. Logistic regression and random forest classifier are the representatives of ML algorithms. We compare our proposed techniques with existing SMOTE, NearMiss, and SMOTETomek data balancing techniques. Both algorithms perform better on the proposed technique-balanced dataset. This technique can be the optimal solution for the imbalanced data handling strategy. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index