Simplitigs as an efficient and scalable representation of de Bruijn graphs

Autor: Michael H. Baym, Karel Břinda, Gregory Kucherov
Přispěvatelé: Department of Biomedical Informatics, Harvard Medical School [Boston] (HMS), Laboratoire d'Informatique Gaspard-Monge (LIGM), École des Ponts ParisTech (ENPC)-Centre National de la Recherche Scientifique (CNRS)-Université Gustave Eiffel, Skolkovo Institute of Science and Technology [Moscow] (Skoltech)
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Simplitigs
Theoretical computer science
lcsh:QH426-470
Generalization
Computer science
representation
Computation
[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]
Method
Storage
Biology
De Bruijn graph
Scaling
symbols.namesake
03 medical and health sciences
0302 clinical medicine
Greedy algorithm
Representation (mathematics)
lcsh:QH301-705.5
de Bruijn graphs
Pan-genomes
030304 developmental biology
k-mers
De Bruijn sequence
0303 health sciences
Sequence
de Bruijn graph representation
Search engine indexing
Sequence analysis
Scalability
Computational Biology
Genomics
Sequence Analysis
DNA

Graph
lcsh:Genetics
Index (publishing)
lcsh:Biology (General)
Data compression
symbols
Indexing
[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]
Algorithms
Software
030217 neurology & neurosurgery
Zdroj: Genome Biology, Vol 22, Iss 1, Pp 1-24 (2021)
Genome Biology
Genome Biology, BioMed Central, 2021, 22 (96), ⟨10.1101/2020.01.12.903443⟩
ISSN: 1465-6906
1474-760X
DOI: 10.1101/2020.01.12.903443⟩
Popis: Motivation: De Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes. Results: We introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes. Availability: ProphAsm is written in C++ and is available under the MIT license from http://github.com/prophyle/prophasm.
Databáze: OpenAIRE