MethylGPT: a foundation model for the DNA methylome.

Autor: Ying K; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.; T. H. Chan School of Public Health, Harvard University, Boston, MA, USA., Song J; Department of Physics, MIT, Cambridge, MA, USA., Cui H; Peter Munk Cardiac Centre, University Health Network, Toronto, ON, Canada.; Department of Computer Science, University of Toronto, Toronto, ON, Canada.; Vector Institute, Toronto, ON, Canada., Zhang Y; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA., Li S; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA., Chen X; Department of Computer Science, University of Toronto, Toronto, ON, Canada.; Vector Institute, Toronto, ON, Canada., Liu H; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA., Eames A; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA., McCartney DL; Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, Scotland, United Kingdom., Marioni RE; Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, Scotland, United Kingdom., Poganik JR; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA., Moqri M; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA., Wang B; Department of Computer Science, University of Toronto, Toronto, ON, Canada.; Vector Institute, Toronto, ON, Canada.; Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada., Gladyshev VN; Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
Jazyk: angličtina
Zdroj: BioRxiv : the preprint server for biology [bioRxiv] 2024 Nov 04. Date of Electronic Publication: 2024 Nov 04.
DOI: 10.1101/2024.10.30.621013
Abstrakt: DNA methylation serves as a powerful biomarker for disease diagnosis and biological age assessment. However, current analytical approaches often rely on linear models that cannot capture the complex, context-dependent nature of methylation regulation. Here we present MethylGPT, a transformer-based foundation model trained on 226,555 (154,063 after QC and deduplication) human methylation profiles spanning diverse tissue types from 5,281 datasets, curated 49,156 CpG sites, and 7.6 billion training tokens. MethylGPT learns biologically meaningful representations of CpG sites, capturing both local genomic context and higher-order chromosomal features without external supervision. The model demonstrates robust methylation value prediction (Pearson R=0.929) and maintains stable performance in downstream tasks with up to 70% missing data. Applied to age prediction across multiple tissue types, MethylGPT achieves superior accuracy compared to existing methods. Analysis of the model's attention patterns reveals distinct methylation signatures between young and old samples, with differential enrichment of developmental and aging-associated pathways. When finetuned to mortality and disease prediction across 60 major conditions using 18,859 samples from Generation Scotland, MethylGPT achieves robust predictive performance and enables systematic evaluation of intervention effects on disease risks, demonstrating potential for clinical applications. Our results demonstrate that transformer architectures can effectively model DNA methylation patterns while preserving biological interpretability, suggesting broad utility for epigenetic analysis and clinical applications.
Databáze: MEDLINE