The Effect of Rebalancing Techniques on the Classification Performance in Cyberbullying Datasets

Autor: Marwa Khairy, Tarek M. Mahmoud, Tarek Abd El-Hafeez
Rok vydání: 2022
DOI: 10.21203/rs.3.rs-1730456/v1
Popis: Machine learning plays an increasingly significant role in the building of Cyberbullying detection systems. In cyberbullying datasets, the percentage of normal labeled classes is higher than the percentage of abnormal labeled ones, which is called as class imbalance problem in data mining. Class imbalance is a challenging problem in classification, especially in the two-class dataset. Conventional machine learning methods also decrease classification performance for unseen samples of the minority class when class distributions are imbalanced. This is because the model appears to be extensively influenced by the majority class. Many researchers have proposed over-sampling and under-sampling techniques in the literature to solve this problem. In this paper, the effect of over-sampling and under-sampling techniques in cyberbullying datasets is examined. For the experimental study, first we perform a preprocessing step to improve the performance of machine learning algorithms. Then we examine the effect of the imbalanced data on the classification performance for four cyberbullying datasets. To study the classification performance on the balanced cyberbullying datasets, four resampling techniques (namely, Random under-sampling, Random Oversampling, SMOTE, SMOTE + TOMEK) are used to rebalance these datasets. The impact of each rebalancing technique on the classification performance using 8 well-known classification algorithms is examined. Our experiments showed that the performance of resampling technique depends on the dataset size, the imbalance ratio, and the classifier used. The conducted experiments proved that there are no techniques that will always perform better the others.
Databáze: OpenAIRE