HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.
Autor: | Press WH; Department of Computer Science, The University of Texas at Austin, Austin, TX 78712; wpress@cs.utexas.edu.; Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712., Hawkins JA; Oden Institute of Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712.; Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.; Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712., Jones SK Jr; Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.; Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712., Schaub JM; Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.; Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712., Finkelstein IJ; Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.; Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712. |
---|---|
Jazyk: | angličtina |
Zdroj: | Proceedings of the National Academy of Sciences of the United States of America [Proc Natl Acad Sci U S A] 2020 Aug 04; Vol. 117 (31), pp. 18489-18496. Date of Electronic Publication: 2020 Jul 16. |
DOI: | 10.1073/pnas.2004821117 |
Abstrakt: | Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed-Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine-cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding. (Copyright © 2020 the Author(s). Published by PNAS.) |
Databáze: | MEDLINE |
Externí odkaz: |