Popis: |
Studies on automatically predicting student learning outcomes often focus on developing and optimizing machine learning algorithms that fit the data captured from different education systems. This approach has a fatal weakness when it is used for disadvantaged groups, such as those with academic warnings or who have dropped out, because these groups are often much smaller than other common groups in number. The imbalanced data that have class distribution skew create a big challenge to training good classification models. The significant approach to tackle this challenge is applying oversampling methods to increase the number of minor classes; however, generating good new samples from the existing instances of a minor class is still a hard issue and requires new investigation. This study presents two new methods of handling data imbalance based on the original algorithms SMOTE and adaptive synthetic sampling (ADASYN), called Improved SMOTE (I_SMOTE) and Improved ADASYN (I_ADASYN). These modifications involve a new selecting fit candidate method based on a new similarity measurement and a roulette wheel selection to generate synthetic data samples. The aim is to rebalance data and therefore improve the prediction accuracy of minor groups. The proposal methods were designed and applied to education datasets, and they were tested on public datasets and a dataset collected from a Vietnamese university for evaluation. The experimental results on learning datasets showed the high potential of novel algorithms, I_SMOTE and I_ADASYN, for student academic performance problems in general and at-risk student groups especially. Empirical results proved that the recall, precision, and F1-score of the minority class of I_SMOTE and I_ADASYN are strongly better than the original balancing algorithms. Besides, the I_SMOTE and I_ADASYN also improve relatively by 6.6% and 8.0% of the ROC area compared to the original SMOTE and ADASYN, respectively. |