Ancestry Inference Using Reference Labeled Clusters of Haplotypes
Autor: | Jake K. Byrnes, Alisa Sedghifar, Catherine A. Ball, Eurie L. Hong, Joshua G. Schraiber, Shiya Song, Keith Noto, David A. Turissini, Yong Wang |
---|---|
Rok vydání: | 2020 |
Předmět: |
QH301-705.5
Computer science Computer applications to medicine. Medical informatics Population R858-859.7 Inference Polymorphism Single Nucleotide RFMix Biochemistry Structural Biology Humans Biology (General) HMM 1000 Genomes Project education Molecular Biology education.field_of_study Genome Human Ancestry inference Methodology Article Applied Mathematics Haplotype Local ancestry ARCHes Computer Science Applications Running time Genetics Population Haplotypes Evolutionary biology Human genome Haplotype modeling |
Zdroj: | BMC Bioinformatics BMC Bioinformatics, Vol 22, Iss 1, Pp 1-14 (2021) |
Popis: | We present ARCHes, a fast and accurate haplotype-based approach for inferring an individual’s ancestry composition. Our approach works by modeling haplotype diversity from a large, admixed cohort of hundreds of thousands, then annotating those models with population information from reference panels of known ancestry. The running time of ARCHes does not depend on the size of a reference panel because training and testing are separate processes, and the inferred population-annotated haplotype models can be written to disk and reused to label large test sets in parallel (in our experiments, it averages less than one minute to assign ancestry from 32 populations to 1,001 sections of a genotype using 10 CPU). We test ARCHes on public data from the 1,000 Genomes Project and HGDP as well as simulated examples of known admixture. Our results demonstrate that ARCHes outperforms RFMix at correctly assigning both global and local ancestry at finer population scales regardless of the amount of population admixture.Author SummaryHuman DNA is inherited from ancestors that come from different populations across the globe and across time. Being able to identify which of those populations make up an individual’s DNA, how much they contribute, and on which chromosomes, is currently an important open research problem with many applications in the study of human diversity and history. As DNA sequencing and genotyping technology has developed, we have greater and greater amounts of data, which allows for the development of new sophisticated machine learning methods to approach this problem, and presents a need to process large amounts of data efficiently. These methods learn from examples of DNA data from known populations, and must be robust to differences in size and diversity among those reference populations. We present a new approach to this problem called ARCHes (Ancestry inference usingReference labeledClusters ofHaplotypes), that models the global diversity of small segments of human DNA sequence (“haplotypes”), and the extent to which these haplotypes are associated with each of a set of population reference panels. It then computes the most likely population assignments and the points along the genome where the populations change. Our experiments show that ARCHes has superior accuracy compared to a state-of-the-art method in identifying source populations and their locations on the genome, regardless of the number of different populations present in the genome, how closely related those populations are. ARCHes is also able to model populations despite having a small amount of population reference DNA data. |
Databáze: | OpenAIRE |
Externí odkaz: |