CluStrat: a structure informed clustering strategy for population stratification
Autor: | Petros Drineas, Aritra Bose, Agniva Chowdhury, Myson C. Burch, Peristera Paschou |
---|---|
Rok vydání: | 2020 |
Předmět: |
0303 health sciences
Mahalanobis distance Linkage disequilibrium Covariance matrix Computer science Correlation and dependence Genome-wide association study Population stratification 03 medical and health sciences 0302 clinical medicine Principal component analysis Statistics Leverage (statistics) Spurious relationship Cluster analysis 030217 neurology & neurosurgery 030304 developmental biology Genetic association |
DOI: | 10.1101/2020.01.15.908228 |
Popis: | Genome-wide association studies (GWAS) have been extensively used to estimate the signed effects of trait-associated alleles. Recent independent studies failed to replicate the strong evidence of selection for height across Europe implying the shortcomings of standard population stratification correction approaches. Here, we present CluStrat, a stratification correction algorithm for complex population structure that leverages the linkage disequilibrium (LD)-induced distances between individuals. CluStrat performs agglomerative hierarchical clustering using the Mahalanobis distance and then applies sketching-based randomized ridge regression on the genotype data to obtain the association statistics. With the growing size of data, computing and storing the genome wide covariance matrix is a non-trivial task. We get around this overhead by computing the GRM directly using a connection between statistical leverage scores and the Mahalanobis distance. We test CluStrat on a large simulation study of discrete and admixed, arbitrarily-structured sub-populations identifying two to three-fold more true causal variants when compared to Principal Component (PC) based stratification correction methods while trading off for a slightly higher spurious associations. Applying CluStrat on WTCCC2 Parkinson’s disease (PD) data, we identified loci mapped to a host of genes associated with PD such as BACH2, MAP2, NR4A2, SLC11A1, UNC5C to name a few.Availability and ImplementationCluStrat source code and user manual is available at: https://github.com/aritra90/CluStrat |
Databáze: | OpenAIRE |
Externí odkaz: |