Distinguishing among complex evolutionary models using unphased whole-genome data through Approximate Bayesian Computation
Autor: | Silvia Ghirotto, Andrea Benazzo, Maria Teresa Vizzari, Guido Barbujani, Francesca Tassi |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2018 |
Předmět: |
0106 biological sciences
0303 health sciences education.field_of_study Computer science business.industry Population Sampling (statistics) Population genetics Inference Locus (genetics) Machine learning computer.software_genre 010603 evolutionary biology 01 natural sciences Genome Random forest 03 medical and health sciences Artificial intelligence Approximate Bayesian computation education business computer Statistic 030304 developmental biology |
DOI: | 10.1101/507897 |
Popis: | Inferring past demographic histories is crucial in population genetics, and the amount of complete genomes now available should in principle facilitate this inference. In practice, however, the available inferential methods suffer from severe limitations. Although hundreds complete genomes can be simultaneously analyzed, complex demographic processes can easily exceed computational constraints, and the procedures to evaluate the reliability of the estimates contribute to increase the computational effort. Here we present an Approximate Bayesian Computation (ABC) framework, based on the Random Forest algorithm, to infer complex past population processes using complete genomes. To do this, we propose to summarize the data by the full genomic distribution of the four mutually exclusive categories of segregating sites (FDSS), a statistic fast to compute from unphased genome data. We constructed an efficient ABC pipeline and tested how accurately it allows one to recognize the true model among models of increasing complexity, using simulated data and taking into account different sampling strategies in terms of number of individuals analyzed, number and size of the genetic loci considered. We tested the power of theFDSSto be informative about even complex evolutionary histories and compared the results with those obtained summarizing the data through the unfolded Site Frequency Spectrum, thus highlighting for both statistics the experimental conditions maximizing the inferential power. Finally, we analyzed two datasets, testing models (a) on the dispersal of anatomically modern humans out of Africa and (b) the evolutionary relationships of the three species of Orangutan inhabiting Borneo and Sumatra. |
Databáze: | OpenAIRE |
Externí odkaz: |