Optimizing machine learning models for classification of stroke patients with epileptiform EEG pattern: the impact of dataset balancing techniques.

Autor: Iscra, Katerina, Biscontin, Alessandro, Miladinovic, Aleksandar, Bonini, Andrea, Furlanis, Giovanni, Prandin, Gabriele, Malesani, Michele, Naccarato, Marcello, Manganotti, Paolo, Accardo, Agostino, Ajčević, Miloš
Předmět:
Zdroj: Procedia Computer Science; 2024, Vol. 246, p4600-4609, 10p
Abstrakt: Epileptiform electroencephalogram (EEG) patterns are commonly observed in stroke patients and can significantly impact clinical management and patient outcomes. Therefore, the classification of the stroke patients in order to identify the subjects with high probability of epileptiform EEG patterns may improve the stroke management. In recent years, there has been a notable increase in interest and utilization of machine learning, especially in the domain of classification tasks. Nevertheless, the presence of imbalanced datasets presents hurdles for machine learning algorithms, resulting in skewed predictions toward dominant classes and diminished accuracy, especially for underrepresented ones. Hence, the study aims to evaluate the effects of dataset balancing methods on the classification efficacy of machine learning models for classification of stroke patients with epileptiform EEG patterns by conducting a comparative analysis between models trained on imbalanced and balanced datasets. Four different sampling techniques were employed: an oversampling technique, SMOTENC; an undersampling technique, NearMiss; and two techniques that combine over- and undersampling methods, SMOTEToken and SMOTEENN. The features selection was made using the ReliefF scoring method and for model construction, only features that presented a contribution value greater than 0.01 were utilized. Five different machine learning models were considered in the study: classification tree, logistic regression, naïve Bayes, artificial neural network and support vector machine. The produced models were trained on the original and resampled training set and subsequently the models' performances were evaluated on the test set. The results showed that SMOTENC was the most effective among the considered dataset balancing techniques, showing superior classification performance compared to other methods and the original dataset. Models utilizing SMOTENC exhibited significant improvements in AUC (0.76 vs 0.67) and specificity values (0.73 vs 0.43) while maintaining comparable accuracy (0.72 vs 0.74) to those trained on the original dataset, respectively. Furthermore, it has been noted that different sampling techniques result in different selection of the most predictive features. In conclusion, our study highlights the crucial role of utilizing dataset balancing methods to improve the classification performances of predictive models in case of highly unbalanced datasets such as case of stratification of stroke patients with epileptiform EEG patterns. [ABSTRACT FROM AUTHOR]
Databáze: Supplemental Index