Preeclampsia Predictor with Machine Learning: A Comprehensive and Bias-Free Machine Learning Pipeline

Autor: Yun C. Lin, Daniel Mallia, Andrea O. Clark-Sevilla, Adam Catto, Alisa Leshchenko, David M. Haas, Ronald Wapner, Itsik Pe’er, Anita Raja, Ansaf Salleb-Aouissi
Rok vydání: 2022
DOI: 10.1101/2022.06.08.22276107
Popis: Preeclampsia is a type of hypertension that develops during pregnancy. It is one of the leading causes for maternal morbidity with consequences during and after pregnancy. Because of its diverse clinical presentation, preeclampsia is a uniquely challenging adverse pregnancy outcome to predict and manage. In this paper, we explore preeclampsia in a nulliparous study cohort with machine learning techniques to build a model that distinguishes between participants most at risk for morbidity, those with preeclampsia with severe features or eclampsia, and the class of no pregnancy-related hypertension. We curated the dataset for this secondary analysis using only training examples that have all known biomarkers, factors, and placental analytes. We built classification models at discrete time points in pregnancy that combine risk factors for preeclampsia with severe features or eclampsia to help screen cases early in pregnancy. The time points are at 60 − 136 (V1), 160 − 216 (V2), 220 − 296 (V3) weeks gestation and at delivery (V4). We then analyzed the model prediction results and provided an interpretable report of cut-off points of the top contributing risk factors and their impact on prediction. Finally, we identified race-based biases in our models and describe how we mitigate those biases. We evaluated the results of four machine learning algorithms and found that ensemble methods outperformed non-ensemble methods. Random Forest models achieved an area under receiver operating characteristic curve at V1 of 0.68 ± 0.05, V2 of 0.73 ± 0.05, V3 of 0.76 ± 0.04 and V4 of 0.83 ± 0.03. Analyzing the Random Forest models, the features found to be most informative across all visits fall into several broad categories: weight, blood pressure measurements, uterine artery doppler measurements, diet intake and serum biomarkers. We found that our models are biased toward non-Hispanic black participants with a high predictive equality ratio of 1.31. We corrected this bias and reduced this ratio to 1.14. We also evaluated results for predictions of early cases versus late preeclampsia with severe features or eclampsia and found that placental analytes as the top contributors in model feature importance. Random Forest for this analysis achieved an area under receiver operating characteristic curve at V1 of 0.63 ± 0.11, V2 of 0.79 ± 0.11, V3 of 0.83 ± 0.08 and V4 of 0.84 ± 0.09. Our experiments suggest that it is important and possible to create screening models to predict the participants at risk of developing preeclampsia with severe features and eclampsia. The top features stress the importance of using several tests, in particular tests for biomarkers and ultrasound measurements. The models could be used as a screening tool as early as 6-13 weeks gestation to help clinicians identify participants who may subsequently develop preeclampsia, confirming the cases they suspect or identifying unsuspected cases. The proposed approach is easily adaptable to address any adverse pregnancy outcome with fairness.
Databáze: OpenAIRE