A machine learning approach using conditional normalizing flow to address extreme class imbalance problems in personal health records

Autor:	Yeongmin Kim, Wongyung Choi, Woojeong Choi, Grace Ko, Seonggyun Han, Hwan-Cheol Kim, Dokyoon Kim, Dong-gi Lee, Dong Wook Shin, Younghee Lee
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	Personal health record Class imbalance Machine learning Conditional normalizing flow Computer applications to medicine. Medical informatics R858-859.7 Analysis QA299.6-433
Zdroj:	BioData Mining, Vol 17, Iss 1, Pp 1-18 (2024)
Druh dokumentu:	article
ISSN:	1756-0381
DOI:	10.1186/s13040-024-00366-0
Popis:	Abstract Background Supervised machine learning models have been widely used to predict and get insight into diseases by classifying patients based on personal health records. However, a class imbalance is an obstacle that disrupts the training of the models. In this study, we aimed to address class imbalance with a conditional normalizing flow model, one of the deep-learning-based semi-supervised models for anomaly detection. It is the first introduction of the normalizing flow algorithm for tabular biomedical data. Methods We collected personal health records from South Korean citizens (n = 706), featuring genetic data obtained from direct-to-customer service (microarray chip), medical health check-ups, and lifestyle log data. Based on the health check-up data, six chronic diseases were labeled (obesity, diabetes, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension). After preprocessing, supervised classification models and semi-supervised anomaly detection models, including conditional normalizing flow, were evaluated for the classification of diabetes, which had extreme target imbalance (about 2%), based on AUROC and AUPRC. In addition, we evaluated their performance under the assumption of insufficient collection for patients with other chronic diseases by undersampling disease-affected samples. Results While LightGBM (the best-performing model among supervised classification models) showed AUPRC 0.16 and AUROC 0.82, conditional normalizing flow achieved AUPRC 0.34 and AUROC 0.83 during fifty evaluations of the classification of diabetes, whose base rate was very low, at 0.02. Moreover, conditional normalizing flow performed better than the supervised model under a few disease-affected data numbers for the other five chronic diseases – obesity, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension. For example, while LightGBM performed AUPRC 0.20 and AUROC 0.75, conditional normalizing flow showed AUPRC 0.30 and AUROC 0.74 when predicting obesity, while undersampling disease-affected samples (positive undersampling) lowered the base rate to 0.02. Conclusions Our research suggests the utility of conditional normalizing flow, particularly when the available cases are limited, for predicting chronic diseases using personal health records. This approach offers an effective solution to deal with sparse data and extreme class imbalances commonly encountered in the biomedical context.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/3e6653a7d4dd41d1976fae83356d3848 Zobrazit plný text záznamu View record in DOAJ Plný text ve formátu PDF Plný text ve formátu HTML
Nepřihlášeným uživatelům se plný text nezobrazuje	K zobrazení výsledku je třeba se přihlásit.