Abstract 11946: Comparison of Unsupervised Learning Approaches Applied to Electronic Health Record Traits in Heart Failure

Autor: Reza, Nosheen, Bone, William P, Singhal, Pankhuri, Yang, Yifan, Verma, Anurag, Murthy, Ashwin C, Denduluri, Srinivas, Adusumalli, Srinath, Ritchie, Marylyn, Cappola, Thomas P
Zdroj: Circulation (Ovid); November 2021, Vol. 144 Issue: Supplement 1 pA11946-A11946, 1p
Abstrakt: Introduction:Unsupervised machine learning (UML) applied to high dimensional data has been used to discover cardiovascular disease subtypes; however, the reproducibility of subtypes identified by different algorithms has not been explored. We compared the ability of several promising UML and clustering algorithms to identify heart failure (HF) subtypes using high dimensional electronic health record (EHR) data.Methods:Using the Penn Medicine EHR, we identified all patients who had >2 instances of ICD-10-CM HF diagnosis. We extracted 1272 EHR-based features (vital signs, demographics, echocardiographic measurements, laboratories, comorbidities) from time of HF diagnosis and limited the cohort based on data completeness (n=8569). We selected the following methods based on prior success in simulation studies and used them to identify HF subtypes: Similarity Network Fusion (SNF), Locally Linear Embedding (LLE), Modified LLE, Uniform Manifold Approximation and Projection (UMAP), and Principal Component Analysis (PCA) followed by several clustering algorithms including K-means, Density-based spatial clustering of applications with noise (DBSCAN), and Spectral Clustering. K groups 2-12 were evaluated. Clustering performance was assessed by silhouette score and visual separation.Results:Model visualizations are shown in the Figure. Highest silhouette score achieved for each model varied widely from 0.02-0.62; optimal cluster number ranged from 2-4 across models. Normalization and standardization of continuous data did not significantly alter silhouette scores or optimal cluster number.Conclusions:HF subtypes identified through UML applied to EHR data may vary substantially depending on the algorithms used. Benchmarking strategies to evaluate reproducibility of UML in the EHR are needed to ensure valid HF patient stratification and phenotypic refinement.
Databáze: Supplemental Index