Application of Data Mining Techniques in Human Population Genetic Structure Analysis

Autor: Weng, Zhouyang
Jazyk: angličtina
Rok vydání: 2017
Předmět:
Druh dokumentu: Text
Popis: The success of genome-wide association study (GWAS) depends on genotyping a large number of SNPs and determining which of these SNPs are significantly associated with the outcome of disease. While studying for these associations, it is important to take into account the effects caused by differences of ethnicities and population groups. The study of human population genetic structure focused on analyzing the human genetic variations between different populations and on assigning individuals to subpopulations based on the degree of human genetic variations. Currently the leading statistical method for uncovering population structure in GWAS is Principal Component Analysis (PCA). However one major problem of using PCA on SNPs data is that the principal components that are defined do not correspond to actual SNP variables, we need to find ways that can map the principal components to measure the importance of actual SNP variables in terms of ancestry information. To overcome these limitations, Sparse Principal Component Analysis (SPCA) has been proposed to identify a small set of structure informative markers more efficiently by modifying the alternating regression equation for PCA with including a penalty term during optimization that encourages SNPs with negligible loadings to vanish. Yet the computation costs of selecting a small subset of actual ancestry informative SNP variables via SPCA can still be expensive, especially where a large number of non-zero loadings across multiple principal components are required for structure analysis. Given these limitations, it is desirable to find some methods which not only achieve the population classification but also reduce the number of explicitly used variables and can select actual SNP variables that are ancestry informative markers in a cost-effective manner. The goals of this study will not only focus on making inferences on the application of major data mining methods in human population genetics structure analysis but also on introducing a two-stage approach which combines two popular methods to improve efficiency and accuracy in population classification and variable selection. Specifically, the first step of the proposed two-stage method is to identify a subset of SNP markers that capture major genetic variations between the population groups using SPCA; the second step is to estimate population structure based on the selected SNP markers and conducted the variable selection of ancestry informative markers using Random Forest (RF). Our two-step SPCA-RF approach was tested using empirical and simulated datasets. The empirical dataset came from the simulated next generation sequence data, which was provided for the Genetic Analysis Workshop (GAW) 17 based on the real exome sequence data from the 1000 Genome Project. Results from the two-step SPCA-RF algorithm suggested higher population prediction accuracy with relatively fewer markers are possible. In comparison with the existing methods, the proposed SPCA-RF approach steadily gave a similar or lower value of error rates and retained all important variables that are ancestry informative. Moreover, the implementation of all methods has been carried out in the open source R software, which provides the future researchers with the source code to replicate the research for further investigation.
Databáze: Networked Digital Library of Theses & Dissertations