Improving biological understanding and complex trait prediction by integrating prior information in genomic feature models

Autor: Edwards, Stefan McKinnon
Jazyk: angličtina
Rok vydání: 2014
Zdroj: Edwards, S M 2014, Improving biological understanding and complex trait prediction by integrating prior information in genomic feature models. Aarhus University, Faculty of Science and Technology .
Popis: In this thesis we investigate an approach to integrate external data into the analysis of genetic variants. The goal is similar to that of gene-set enrichment tests, but relies on the robust statistical framework of linear mixed models. This approach has allowed us to integrate virtually any externally founded information, such as KEGG pathways, Gene Ontology gene sets, or genomic features, and estimate the joint contribution of the genetic variants within these sets to complex trait phenotypes. The analysis of complex trait phenotypes is hampered by the myriad of genes that control the trait, as these genes have small to moderate effects that can be difficult to detect. However, by looking at sets of genes, as in gene-set enrichment tests, it may become easier to assess an association between the set of genes and the complex trait. --- The linear mixed models applied here are the same as used for predicting breeding values of future progenies in animal breeding, which is basically a `black box' modelling approach where all genetic variants are represented both equally and independently. To open the black box, we partition the genetic variants according to some external knowledge, allowing the genetic variant to be represented as different sets. We interpret this as separating the genomic signal from the noise. This is however not without issues, as in some populations, such as the Danish Holstein dairy cattle, the genetic variants can be highly correlated across multiple genes. Selecting one of these genes, but not the neighbouring and highly correlated gene, results in problems dividing signal and noise in two highly correlated sets. --- We are therefore faced with the challenge of evaluating whether a model fitted to a given set of genetic variants is significant. We look into several test statistics, such as model fit, explained genetic variance, and predictive ability. Using the latter to rank sets of genetic variants, is to our knowledge a novel approach. --- To evaluate the sets, we rely on empirical derived distributions. The empirical distributions are generated to answer several questions; the permutations that correspond to a self-contained test may answer whether the found association between set and trait is spurious, and the random gene samples that correspond to a competitive test that quantifies the expected association for a set of similar size. In addition to this, we also use cross validations to estimate the predictive ability of a set. --- The thesis consists of 3 chapters introducing the methodology of linear mixed models, their application, and evaluation by means of test statistics. Then 3 manuscripts in which we investigate the usage of our approach in Danish Holstein dairy cattle and the model organism of fruit flies Drosophila melanogaster. In these manuscripts, we use health traits and production traits (the latter only in cattle), which are all complex traits. The union of such odd subjects as dairy cattle and fruit flies enable us to investigate how the population structures of the two species may affect the results, as the dairy cattle is population with many highly related individuals, while the fruit fly population consists of several entirely inbred sub-populations, but with low relatedness between sub-populations. --- This thesis demonstrates the successful application of an integrative approach for enhancing the systems genetics analysis of complex traits. The results shows that by using informed subsets of genetic variants, it is possible to increase the predictive ability in populations of low relatedness; a valuable prospect for fields such as personalised medicine.
Databáze: OpenAIRE