REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Autor: Daniel Gautheret, Camille Marchet, Rayan Chikhi, Mikaël Salson, Zamin Iqbal
Přispěvatelé: Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), European Bioinformatics Institute [Cambridge, UK], Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Algorithmes pour les séquences biologiques - Sequence Bioinformatics, Institut Pasteur [Paris] (IP), This work was supported by ANR Transipedia [ANR-18-CE45-0020 to C.M., D.G., M.S. and R.C.], and INCEPTION [PIA/ANR-16-CONV-0005 to R.C.], ANR-18-CE45-0020,Transipedia,Signatures transcriptionnelles pour une analyse RNA-seq globale(2018), ANR-16-CONV-0005,INCEPTION,Institut Convergences pour l'étude de l'Emergence des Pathologies au Travers des Individus et des populatiONs(2016), Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS)
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Statistics and Probability
MESH: Sequence Analysis
DNA

Computer science
0206 medical engineering
MESH: Algorithms
02 engineering and technology
computer.software_genre
Biochemistry
De Bruijn graph
03 medical and health sciences
symbols.namesake
MESH: Software
0302 clinical medicine
Abundance (ecology)
MESH: Sequence Analysis
RNA

Humans
Molecular Biology
030304 developmental biology
De Bruijn sequence
0303 health sciences
Information retrieval
MESH: Humans
Sequence Analysis
RNA

Search engine indexing
Sequence Analysis
DNA

Graph
Computer Science Applications
Computational Mathematics
Genomic Variation Analysis
Computational Theory and Mathematics
Index (publishing)
k-mer
030220 oncology & carcinogenesis
symbols
Data mining
[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]
Scale (map)
computer
Algorithms
Software
020602 bioinformatics
Zdroj: Bioinformatics
Bioinformatics, 2020, 36 (Supplement_1), pp.i177-i185. ⟨10.1093/bioinformatics/btaa487⟩
Bioinformatics, Oxford University Press (OUP), 2020, 36 (Supplement_1), pp.i177-i185. ⟨10.1093/bioinformatics/btaa487⟩
ISSN: 1367-4803
1367-4811
Popis: Motivation In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. Results We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. Availability and implementation https://github.com/kamimrcht/REINDEER. Supplementary information Supplementary data are available at Bioinformatics online.
Databáze: OpenAIRE