Leveraging deep-learning on raw spirograms to improve genetic understanding and risk scoring of COPD despite noisy labels

Autor: Justin Cosentino, Babak Behsaz, Babak Alipanahi, Zachary R. McCaw, Davin Hill, Tae-Hwi Schwantes-An, Dongbing Lai, Andrew Carroll, Brian D. Hobbs, Michael H. Cho, Cory Y. McLean, Farhad Hormozdiari
Rok vydání: 2022
DOI: 10.1101/2022.09.12.22279863
Popis: Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify new genetic signals. Here we train a deep convolutional neural network on noisy self-reported and ICD-based labels to predict COPD case/control status from high-dimensional raw spirograms and use the model predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls (AUROC = 0.82 ± 0.01) and COPD-related hospitalization (AUROC = 0.89 ± 0.01) without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival (Hazard ratio = 1.22 ± 0.01; P ≤ 2 × 10−16) and exacerbation events (R2 = 0.10 ± 0.01; P ≤ 4 × 10−101). A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci, but also identifies 67 new loci. Thirty-eight of these have supportive evidence in independent datasets, including a locus near LTBR. We demonstrate the biological plausibility of the novel variants through enrichment analyses, phenome-wide association studies, and generalizability of COPD prediction in multiple datasets. These results provide an example of the potential to improve genetic discovery of disease-relevant variants by training deep neural networks to predict noisy labels from high-dimensional raw data.
Databáze: OpenAIRE