REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Autor:	Daniel Gautheret, Camille Marchet, Rayan Chikhi, Mikaël Salson, Zamin Iqbal
Přispěvatelé:	Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), European Bioinformatics Institute [Cambridge, UK], Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Algorithmes pour les séquences biologiques - Sequence Bioinformatics, Institut Pasteur [Paris] (IP), This work was supported by ANR Transipedia [ANR-18-CE45-0020 to C.M., D.G., M.S. and R.C.], and INCEPTION [PIA/ANR-16-CONV-0005 to R.C.], ANR-18-CE45-0020,Transipedia,Signatures transcriptionnelles pour une analyse RNA-seq globale(2018), ANR-16-CONV-0005,INCEPTION,Institut Convergences pour l'étude de l'Emergence des Pathologies au Travers des Individus et des populatiONs(2016), Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS)
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Statistics and Probability MESH: Sequence Analysis DNA Computer science 0206 medical engineering MESH: Algorithms 02 engineering and technology computer.software_genre Biochemistry De Bruijn graph 03 medical and health sciences symbols.namesake MESH: Software 0302 clinical medicine Abundance (ecology) MESH: Sequence Analysis RNA Humans Molecular Biology 030304 developmental biology De Bruijn sequence 0303 health sciences Information retrieval MESH: Humans Sequence Analysis RNA Search engine indexing Sequence Analysis DNA Graph Computer Science Applications Computational Mathematics Genomic Variation Analysis Computational Theory and Mathematics Index (publishing) k-mer 030220 oncology & carcinogenesis symbols Data mining [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] Scale (map) computer Algorithms Software 020602 bioinformatics
Zdroj:	Bioinformatics Bioinformatics, 2020, 36 (Supplement_1), pp.i177-i185. ⟨10.1093/bioinformatics/btaa487⟩ Bioinformatics, Oxford University Press (OUP), 2020, 36 (Supplement_1), pp.i177-i185. ⟨10.1093/bioinformatics/btaa487⟩
ISSN:	1367-4803 1367-4811
Popis:	Motivation In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. Results We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. Availability and implementation https://github.com/kamimrcht/REINDEER. Supplementary information Supplementary data are available at Bioinformatics online.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::b63ad761bc18c70c8a4842296427491c https://hal.science/hal-03413006/file/btaa487.pdf Zobrazit plný text záznamu