Latent Forests to Model Genetical Data for the Purpose of Multilocus Genome-Wide Association Studies. Which Clustering Should Be Chosen?
Autor: | Duc-Thanh Phan, Philippe Leray, Christine Sinoquet |
---|---|
Přispěvatelé: | Laboratoire d'Informatique de Nantes Atlantique (LINA), Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN), ANR-13-MONU-0013,SAMOGWAS,Modèles graphiques avancés pour les études d'association à l'échelle du génome(2013) |
Rok vydání: | 2015 |
Předmět: |
probabilistic graphical model
genome-wide association study multilocus association study Computer science Bayesian network Context (language use) Genome-wide association study Latent variable data dimension reduction computer.software_genre Set (abstract data type) latent variable Data mining Graphical model [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] Cluster analysis computer linkage disequilibrium Genetic association |
Zdroj: | Biomedical Engineering Systems and Technologies ISBN: 9783319277066 BIOSTEC (Selected Papers) Communication in Computer and Information Science Communication in Computer and Information Science, Springer, pp.17, 2015, BIOSTEC2015 |
DOI: | 10.1007/978-3-319-27707-3_11 |
Popis: | International audience; The aim of genetic association studies, and in particular genome-wide association stu-dies (GWASs), is to unravel the genetics of complex diseases. In this domain, machine learningoffers an attractive alternative to classical statistical approaches. The seminal works of Mouradet al. (2011) have led to the proposal of a novel class of probabilistic graphical models, the forest oflatent trees (FLTM). The design of this model was motivated by the necessity to model genet-ical data at the genome scale, prior to a multilocus GWAS. A multilocus GWAS fully exploitsinformation about the complex dependences existing within genetical data, to help detect the lociassociated with the studied pathology. The FLTM framework also allows data dimension reduc-tion. The FLTM model is a hierarchical Bayesian network with latent variables. Central to theFLTM construction is the recursive clustering of variables, in a bottom up subsuming process.This article focuses on the analysis of the impact of the choice of the clustering method used inthe FLTM learning algorithm, in a GWAS context. We rely on a real GWAS data set describing41400 variables for each of 3004 controls and 2005 cases affected by Crohn’s disease, and comparethe impact of three clustering methods. We compare statistics about data dimension reductionas well as trends concerning the ability to split or group putative causal SNPs in agreement withthe underlying biological reality. To assess the risk of missing significant association results dueto subsumption, we also compare the clustering methods through the corresponding FLTM-basedGWASs. In the GWAS context and in this framework, the choice of the clustering method doesnot influence the satisfying performance of the GWAS. |
Databáze: | OpenAIRE |
Externí odkaz: |