Abstrakt: |
f-statistics have emerged as a first line of analysis for making inferences about demographic history from genome-wide data. Not only are they guaranteed to allow robust tests of the fits of proposed models of population history to data when analyzing full genome sequencing data—that is, all single nucleotide polymorphisms (SNPs) in the individuals being analyzed—but they are also guaranteed to allow robust tests of models for SNPs ascertained as polymorphic in a population that is an outgroup in a phylogenetic sense to all groups being analyzed. True "outgroup ascertainment" is in practice impossible in humans because our species has arisen from a substructured ancestral population that does not descend from a homogeneous ancestral population going back many hundreds of thousands of years into the past. However, initial studies suggested that non-outgroup-ascertainment schemes might produce robust enough results using f-statistics, and that motivated widespread fitting of models to data using non-outgroup-ascertained SNP panels such as the "Affymetrix Human Origins array" which has been genotyped on thousands of modern individuals from hundreds of populations, or the "1240k" in-solution enrichment reagent which has been the source of about 70% of published genome-wide data for ancient humans. In this study, we show that while analyses of population history using such panels work well for studies of relationships among non-African populations and one African outgroup, when co-modeling more than one sub-Saharan African and/or archaic human groups (Neanderthals and Denisovans), fitting of f-statistics to such SNP sets is expected to frequently lead to false rejection of true demographic histories, and failure to reject incorrect models. Analyzing panels of SNPs polymorphic in archaic humans, which has been suggested as a solution for the ascertainment problem, has limited statistical power and retains important biases. However, by carrying out simulations of diverse demographic histories, we show that bias in inferences based on f-statistics can be minimized by ascertaining on variants common in a union of diverse African groups; such ascertainment retains high statistical power while allowing co-analysis of archaic and modern groups. Author summary: Archaeogenetic research on humans remains heavily biased towards Europe, Central and East Asia due to poor preservation of ancient DNA in hot climate. However, the number of studies focused on the history of African human populations is growing. Due to the DNA preservation problems, using targeted enrichment for selected variable loci is almost unavoidable in archaeogenetic research focused on Africans. Moreover, poor quality of archaeogenetic data makes the analytical toolkit rather limited: it is often restricted to methods based on f-statistics, PCA, and ADMIXTURE. It is known that f-statistics may be biased when they are calculated not on whole-genome data, but on sets of SNPs selected in a non-random way. Although this is common knowledge, biases affecting f-statistics on such SNP sets ("ascertainment biases") remain poorly explored in practice, and our study is designed to fill this gap. We investigate biases affecting individual f4-statistics and fits of admixture graph models on simulated and real data, explore dozens of ascertainment schemes, and provide a set of guidelines for minimizing bias. We show that ascertainment bias is particularly strong in situations when several African populations are co-analyzed with non-African and archaic (Neanderthal or Denisovan) human groups. [ABSTRACT FROM AUTHOR] |