Popis: |
Disease samples are naturally fewer than healthy samples which introduces bias in the training of machine learning (ML) models. Current study focuses in learning discriminating patterns between cesarean and non-cesarean phenomena based on a dataset consisting of 161 features of total 692 cesarean and 5465 non-cesarean samples which comes as four folds based on four different hospitals (hospital A, B, C and D). The dataset is noisy, contains missing values, features are at different scales and above all, 161 features are quite a large in number and risks containing unnecessary information with respect to learning to separate the C-section class from non-cesarean.This study introduced a data pre-processing pipeline, resolving issues with data imbalance, handling missing values, identifying and deleting outliers, etc. A novel ensemble model is proposed which is able to consistently perform better irrespective of data volumes (data fold A, A+B, A+B+C and A+B+C+D) and pre-processing pipeline and achieved 96-99% accuracy across data volumes. Finally, the proposed model’s decision-making was explained in terms of prominent features where higher values of features like Episiotomy, age of women and Fetal intrapartum pH accounts for causing C-section. |