Simplitigs as an efficient and scalable representation of de Bruijn graphs
Autor: | Michael H. Baym, Karel Břinda, Gregory Kucherov |
---|---|
Přispěvatelé: | Department of Biomedical Informatics, Harvard Medical School [Boston] (HMS), Laboratoire d'Informatique Gaspard-Monge (LIGM), École des Ponts ParisTech (ENPC)-Centre National de la Recherche Scientifique (CNRS)-Université Gustave Eiffel, Skolkovo Institute of Science and Technology [Moscow] (Skoltech) |
Jazyk: | angličtina |
Rok vydání: | 2021 |
Předmět: |
Simplitigs
Theoretical computer science lcsh:QH426-470 Generalization Computer science representation Computation [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] Method Storage Biology De Bruijn graph Scaling symbols.namesake 03 medical and health sciences 0302 clinical medicine Greedy algorithm Representation (mathematics) lcsh:QH301-705.5 de Bruijn graphs Pan-genomes 030304 developmental biology k-mers De Bruijn sequence 0303 health sciences Sequence de Bruijn graph representation Search engine indexing Sequence analysis Scalability Computational Biology Genomics Sequence Analysis DNA Graph lcsh:Genetics Index (publishing) lcsh:Biology (General) Data compression symbols Indexing [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] Algorithms Software 030217 neurology & neurosurgery |
Zdroj: | Genome Biology, Vol 22, Iss 1, Pp 1-24 (2021) Genome Biology Genome Biology, BioMed Central, 2021, 22 (96), ⟨10.1101/2020.01.12.903443⟩ |
ISSN: | 1465-6906 1474-760X |
DOI: | 10.1101/2020.01.12.903443⟩ |
Popis: | Motivation: De Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes. Results: We introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes. Availability: ProphAsm is written in C++ and is available under the MIT license from http://github.com/prophyle/prophasm. |
Databáze: | OpenAIRE |
Externí odkaz: |