EDAR: an efficient error detection and removal algorithm for next generation sequencing data

Autor:	Xiaohong Zhao, Lance E. Palmer, Cristian Mircean, Gayle M. Wittenberg, Randall Bolanos, Daniel Fasulo
Rok vydání:	2010
Předmět:	Genome Computer science Shotgun sequencing Process (computing) Sequence assembly Word error rate Computational Biology Sequence Analysis DNA Substring Reduction (complexity) Set (abstract data type) Computational Mathematics Computational Theory and Mathematics Modeling and Simulation Genetics Error detection and correction Molecular Biology Algorithm Sequence Alignment Algorithms
Zdroj:	Journal of computational biology : a journal of computational molecular cell biology. 17(11)
ISSN:	1557-8666
Popis:	Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3a02ba526e3f582f4a7fe86484ce50a5 https://pubmed.ncbi.nlm.nih.gov/20973743 Zobrazit plný text záznamu