Accuracy of de novo assembly of DNA sequences from double-digest libraries varies substantially among software.

Autor: LaCava MEF; Program in Ecology, University of Wyoming, Laramie, WY, USA.; Wildlife Genomics and Disease Ecology Laboratory, Department of Veterinary Sciences, University of Wyoming, Laramie, WY, USA., Aikens EO; Program in Ecology, University of Wyoming, Laramie, WY, USA.; Wyoming Cooperative Fish and Wildlife Research Unit, Department of Zoology and Physiology, University of Wyoming, Laramie, WY, USA., Megna LC; Program in Ecology, University of Wyoming, Laramie, WY, USA.; Department of Zoology and Physiology, University of Wyoming, Laramie, WY, USA., Randolph G; Genome Technologies Lab, University of Wyoming, Laramie, WY, USA., Hubbard C; Program in Ecology, University of Wyoming, Laramie, WY, USA.; Department of Botany, University of Wyoming, Laramie, WY, USA., Buerkle CA; Program in Ecology, University of Wyoming, Laramie, WY, USA.; Department of Botany, University of Wyoming, Laramie, WY, USA.
Jazyk: angličtina
Zdroj: Molecular ecology resources [Mol Ecol Resour] 2020 Mar; Vol. 20 (2), pp. 360-370. Date of Electronic Publication: 2019 Nov 25.
DOI: 10.1111/1755-0998.13108
Abstrakt: Advances in DNA sequencing have made it feasible to gather genomic data for non-model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD-HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD-HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.
(© 2019 John Wiley & Sons Ltd.)
Databáze: MEDLINE