Accelerating Genome- and Phenome-Wide Association Studies using GPUs - A case study using data from the Million Veteran Program.

Autor: Rodriguez A; Data Science and Learning, Argonne National Laboratory, Lemont, IL, 60439, USA., Kim Y; Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, 60439, USA., Nandi TN; Data Science and Learning, Argonne National Laboratory, Lemont, IL, 60439, USA., Keat K; Institute for Biomedical Informatics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA., Kumar R; Institute for Biomedical Informatics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA., Bhukar R; Program in Medical and Population Genetics, Cambridge, MA, 02142, USA.; Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, 02114, USA., Conery M; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA., Liu M; Department of Biostatistics, Columbia University's Mailman School of Public Health, New York, NY, 10032, USA., Hessington J; Information systems, University of Pennsylvania, Philadelphia, PA, 19104, USA., Maheshwari K; Oak Ridge National Laboratory, Oak Ridge, TN, USA., Schmidt D; Oak Ridge National Laboratory, Oak Ridge, TN, USA., Begoli E; Oak Ridge National Laboratory, Oak Ridge, TN, USA., Tourassi G; Computing and Computational Sciences Directorate, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA., Muralidhar S; Office of Research and Development, Department of Veterans Affairs, Washington, DC, 20420, USA., Natarajan P; Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, 02114, USA.; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.; Program in Medical and Population Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA.; Cardiology Division, Massachusetts General Hospital, Boston, MA, 02114, USA., Voight BF; Corporal Michael Crescenz VA Medical Center, Philadelphia, PA, 19104, USA.; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA.; Department of Genetics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA.; Institute of Translational Medicine and Therapeutics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA., Cho K; MVP Boston Coordinating Center, VA Boston Healthcare System, Boston, MA, 02111, USA.; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.; Department of Medicine, Division of Aging, Brigham and Women's Hospital, Boston, MA, 02115, USA., Gaziano JM; MVP Boston Coordinating Center, VA Boston Healthcare System, Boston, MA, 02111, USA.; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.; Department of Medicine, Division of Aging, Brigham and Women's Hospital, Boston, MA, 02115, USA., Damrauer SM; Corporal Michael Crescenz VA Medical Center, Philadelphia, PA, 19104, USA.; Department of Genetics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA.; Department of Surgery, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA.; Cardiovascular Institute, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA., Liao KP; Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, 02130, USA.; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.; Medicine, Rheumatology, VA Boston Healthcare System, Boston, MA, 02130, USA.; Department of Medicine, Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, Boston, MA, 02115, USA., Zhou W; Program in Medical and Population Genetics, Cambridge, MA, 02142, USA.; Department of Medicine, Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, 02114, USA.; Stanley Center for Psychiatric Research, Cambridge, MA, 02142, USA., Huffman JE; Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, 02130, USA.; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.; Palo Alto Veterans Institute for Research (PAVIR), Palo Alto Health Care System, Palo Alto, CA, 94304, USA., Verma A; Corporal Michael Crescenz VA Medical Center, Philadelphia, PA, 19104, USA.; Institute for Biomedical Informatics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA.; Department of Medicine, Division of Translational Medicine and Human Genetics, University of Pennsylvania - Perelman School of Medicine, Philadelphia, PA, 19104, USA., Madduri RK; Data Science and Learning, Argonne National Laboratory, Lemont, IL, 60439, USA.
Jazyk: angličtina
Zdroj: BioRxiv : the preprint server for biology [bioRxiv] 2024 May 22. Date of Electronic Publication: 2024 May 22.
DOI: 10.1101/2024.05.17.594583
Abstrakt: The expansion of biobanks has significantly propelled genomic discoveries yet the sheer scale of data within these repositories poses formidable computational hurdles, particularly in handling extensive matrix operations required by prevailing statistical frameworks. In this work, we introduce computational optimizations to the SAIGE (Scalable and Accurate Implementation of Generalized Mixed Model) algorithm, notably employing a GPU-based distributed computing approach to tackle these challenges. We applied these optimizations to conduct a large-scale genome-wide association study (GWAS) across 2,068 phenotypes derived from electronic health records of 635,969 diverse participants from the Veterans Affairs (VA) Million Veteran Program (MVP). Our strategies enabled scaling up the analysis to over 6,000 nodes on the Department of Energy (DOE) Oak Ridge Leadership Computing Facility (OLCF) Summit High-Performance Computer (HPC), resulting in a 20-fold acceleration compared to the baseline model. We also provide a Docker container with our optimizations that was successfully used on multiple cloud infrastructures on UK Biobank and All of Us datasets where we showed significant time and cost benefits over the baseline SAIGE model.
Databáze: MEDLINE