REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
Autor: | Daniel Gautheret, Camille Marchet, Rayan Chikhi, Mikaël Salson, Zamin Iqbal |
---|---|
Přispěvatelé: | Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), European Bioinformatics Institute [Cambridge, UK], Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Algorithmes pour les séquences biologiques - Sequence Bioinformatics, Institut Pasteur [Paris] (IP), This work was supported by ANR Transipedia [ANR-18-CE45-0020 to C.M., D.G., M.S. and R.C.], and INCEPTION [PIA/ANR-16-CONV-0005 to R.C.], ANR-18-CE45-0020,Transipedia,Signatures transcriptionnelles pour une analyse RNA-seq globale(2018), ANR-16-CONV-0005,INCEPTION,Institut Convergences pour l'étude de l'Emergence des Pathologies au Travers des Individus et des populatiONs(2016), Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS) |
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
Statistics and Probability
MESH: Sequence Analysis DNA Computer science 0206 medical engineering MESH: Algorithms 02 engineering and technology computer.software_genre Biochemistry De Bruijn graph 03 medical and health sciences symbols.namesake MESH: Software 0302 clinical medicine Abundance (ecology) MESH: Sequence Analysis RNA Humans Molecular Biology 030304 developmental biology De Bruijn sequence 0303 health sciences Information retrieval MESH: Humans Sequence Analysis RNA Search engine indexing Sequence Analysis DNA Graph Computer Science Applications Computational Mathematics Genomic Variation Analysis Computational Theory and Mathematics Index (publishing) k-mer 030220 oncology & carcinogenesis symbols Data mining [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] Scale (map) computer Algorithms Software 020602 bioinformatics |
Zdroj: | Bioinformatics Bioinformatics, 2020, 36 (Supplement_1), pp.i177-i185. ⟨10.1093/bioinformatics/btaa487⟩ Bioinformatics, Oxford University Press (OUP), 2020, 36 (Supplement_1), pp.i177-i185. ⟨10.1093/bioinformatics/btaa487⟩ |
ISSN: | 1367-4803 1367-4811 |
Popis: | Motivation In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. Results We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. Availability and implementation https://github.com/kamimrcht/REINDEER. Supplementary information Supplementary data are available at Bioinformatics online. |
Databáze: | OpenAIRE |
Externí odkaz: |