Enhancing Prediction Accuracy in an Imbalanced Dataset of Dengue Infection Cases Using a Two-layer Ensemble Outlier Detection and Feature Selection Technique.

Autor: Fahmi, Amiq, Purwitasari, Diana, Sumpeno, Surya, Purnomo, Mauridhi Hery
Předmět:
Zdroj: International Journal of Intelligent Engineering & Systems; 2024, Vol. 17 Issue 2, p544-560, 17p
Abstrakt: Real-world datasets frequently compromise considerably on noise, resulting in the emergence of outlier data. Detecting and removing outliers in large and imbalanced datasets is a challenging and exciting study in machine learning, especially in healthcare, for accurate prediction. Therefore, it is essential to handle outliers properly, as their presence in classification datasets leads to more difficult, inaccurate, and lower predictive modelling performance. The study proposes methods to enhance prediction accuracy in an imbalanced real-world health dataset of dengue infection cases. First, use a two-layer ensemble method called IsFLOF, which involves an isolation forest (IsF) and a local outlier factor (LOF) to find and accurately eliminate global and local outliers. This approach overcomes the limitations of the IsF algorithm, which is only sensitive to global outliers but vulnerable to local outliers, while LOF excels in local outlier detection but has high complexity. Second, once a dataset with correctly measured value distributions was obtained by eliminating outliers, a resampling process was conducted to prevent prediction bias caused by imbalanced instance data in the multi-class setting. Subsequently, insignificant features were filtered out to further refine the dataset. In the end, eight machine learning algorithms are used to test the robustness and effectiveness of the proposed method. The experimental results showed that the AdaBoost classifier, combined with selected features from the Fast Correlation-Based Filter (FCBF), achieved 93.5% and 95.1% accuracy in training and testing, respectively. In a more distant context, the proposed method is tested and compared with recent methods, including using a public dataset of imbalanced hypothyroid cases. It showed higher and more acceptable prediction accuracy than the original and synthetic data. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index