Autor: |
Cataldo-Ramirez CC; Department of Anthropology, University of California Davis, Davis, CA, 95616, USA.; Department of Population and Public Health Sciences, Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, CA 91001, USA., Lin M; Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA., Mcmahon A; Department of Anthropology, University of California Davis, Davis, CA, 95616, USA., Gignoux CR; Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA., Weaver TD; Department of Anthropology, University of California Davis, Davis, CA, 95616, USA., Henn BM; Department of Anthropology, University of California Davis, Davis, CA, 95616, USA.; UC Davis Genome Center, University of California Davis, Davis, CA, 95616, USA. |
Abstrakt: |
Genome-wide association studies (GWAS) and polygenic score (PGS) development are typically constrained by the data available in biobank repositories in which European cohorts are vastly overrepresented. Here, we increase the utility of non-European participant data within the UK Biobank (UKB) by characterizing the genetic affinities of UKB participants who self-identify as Bangladeshi, Indian, Pakistani, "White and Asian" (WA), and "Any Other Asian" (AOA), towards creating a more robust South Asian sample size for future genetic analyses. We assess the relationships between genetic structure and self-selected ethnic identities resulting in consistent patterns of clustering used to train a support vector machine (SVM). The SVM model was utilized to reassign n = 1,853 AOA and WA participants at the subcontinental level, and increase the sample size of the UKB South Asian group by 1,381 additional participants. We then leverage these samples to assess GWAS performance and PGS development. We further include environmental covariates in the height GWAS by implementing a rigorous covariate selection procedure, and compare the outputs of two GWAS models: GWAS null and GWAS env . We show that PGS performance derived from environmentally adjusted GWAS yields comparable prediction to PGS models developed with an order of magnitude larger training dataset ( R 2 =0.021 vs 0.026). Models with 7 - 8 environmental covariates double the variance explained by PGS alone. In summary, we demonstrate how GWAS performance can be improved by leveraging ambiguous ethnicity codes, ancestry matched imputation panels, and including environmental covariates. |