Extensive sequencing of seven human genomes to characterize benchmark reference materials
Autor: | William Stedman, Han Cao, Michael Saghbini, Jason Bobe, Alex Hastie, David Catoe, Arend Sidow, Marc L. Salit, Kristina Giorda, Alexander Wait Zaranek, Stephen T. Sherry, Zeljko Dzakula, Gintaras Deikus, R Truty, Erich Jaeger, Alexa B. R. McIntyre, Karoline Bjarnesdatter Rypdal, Christopher C. Chang, Robert Sebra, Srinka Ghosh, Grace X.Y. Zheng, Jonathan Trow, Yuling Liu, Tiffany Y. Liang, Khoa Pham, Fiona Hyland, Heather Ordonez, Dhruva Chandramohan, Noah Spies, Ziming Weng, Sofia Kyriazopoulou-Panagiotopoulou, Yutao Fu, Eric E. Schadt, Lindsay K. Vang, Ali Bashir, Madeleine Ball, Christopher E. Mason, Preston W. Estep, Keyan Zhao, George M. Church, Justin M. Zook, Ying Sheng, Mark Chaisson, Patrick Marks, Natali Gulbahce, Elizabeth Henaff, Patrice A Mudivarti, Feng Chen, Jennifer McDaniel, Michael Schnall-Levin, Chunlin Xiao, Noah Alexander, Ali Moshrefi |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2016 |
Předmět: |
0301 basic medicine
Statistics and Probability Data Descriptor Standards Genotype Sequence assembly Genomics Computational biology Biology Library and Information Sciences Polymorphism Single Nucleotide Genome DNA sequencing Education 03 medical and health sciences 0302 clinical medicine INDEL Mutation Genome assembly algorithms Humans Exome Internet Genome Human Sequence Analysis DNA Computer Science Applications Personal Genome Project Benchmarking 030104 developmental biology Genetic Techniques Next-generation sequencing Human genome Nanopore sequencing Statistics Probability and Uncertainty Software 030217 neurology & neurosurgery Research Article Information Systems |
Zdroj: | Scientific Data BMC Genomics |
ISSN: | 2052-4463 |
DOI: | 10.1038/sdata.2016.25 |
Popis: | Background The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). High-quality benchmark small variant calls for the pilot National Institute of Standards and Technology (NIST) Reference Material (NA12878) have been developed by the Genome in a Bottle Consortium, but no similar high-quality benchmark SV calls exist for this genome. Since SV callers output highly discordant results, we developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (svclassify) calculates annotations from one or more aligned bam files from many high-throughput sequencing technologies, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives. Results We first used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions. We then used svclassify to cluster and classify these deletions as well as a set of high-confidence deletions from the 1000 Genomes Project and a set of breakpoint-resolved complex insertions from Spiral Genetics. We find that likely SVs cluster separately from likely non-SVs based on our annotations, and that the SVs cluster into different types of deletions. We then developed a supervised one-class classification method that uses a training set of random non-SV regions to determine whether candidate SVs have abnormal annotations different from most of the genome. To test this classification method, we use our pedigree-based breakpoint-resolved SVs, SVs validated by the 1000 Genomes Project, and assembly-based breakpoint-resolved insertions, along with semi-automated visualization using svviz. Conclusions We find that candidate SVs with high scores from multiple technologies have high concordance with PCR validation and an orthogonal consensus method MetaSV (99.7 % concordant), and candidate SVs with low scores are questionable. We distribute a set of 2676 high-confidence deletions and 68 high-confidence insertions with high svclassify scores from these call sets for benchmarking SV callers. We expect these methods to be particularly useful for establishing high-confidence SV calls for benchmark samples that have been characterized by multiple technologies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2366-2) contains supplementary material, which is available to authorized users. |
Databáze: | OpenAIRE |
Externí odkaz: |