Data structures based on k -mers for querying large collections of sequencing data sets

Autor:	Paul Medvedev, Simon J. Puglisi, Camille Marchet, Christina Boucher, Mikaël Salson, Rayan Chikhi
Přispěvatelé:	Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), University of Florida [Gainesville] (UF), Helsingin yliopisto = Helsingfors universitet = University of Helsinki, Pennsylvania State University (Penn State), Penn State System, ANR-18-CE45-0020,Transipedia,Signatures transcriptionnelles pour une analyse RNA-seq globale(2018), University of Helsinki, Université de Lille-Ecole Centrale de Lille-Centre National de la Recherche Scientifique (CNRS), Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Séquence, Structure et Fonction des ARN (SSFA), Département Biologie des Génomes (DBG), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Department of Computer Science, Helsinki Institute for Information Technology, Algorithmic Bioinformatics, Bioinformatics
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	European Nucleotide Archive READS Sequencing data THOUSANDS Review DATABASES Biology 03 medical and health sciences 0302 clinical medicine SEARCH Genetics Genetics (clinical) ComputingMilieux_MISCELLANEOUS 030304 developmental biology 0303 health sciences ALIGNMENT-FREE Information retrieval DE-BRUIJN GRAPHS 1184 Genetics developmental biology physiology High-Throughput Nucleotide Sequencing Reproducibility of Results Petabyte QUANTIFICATION Data structure 1182 Biochemistry cell and molecular biology [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] Algorithms Software 030217 neurology & neurosurgery Intuition
Zdroj:	Genome Research Genome Research, 2021, 31 (1), pp.1-12. ⟨10.1101/gr.260604.119⟩ Genome Research, Cold Spring Harbor Laboratory Press, 2021, 31 (1), pp.1-12. ⟨10.1101/gr.260604.119⟩ Genome Res
ISSN:	1088-9051 1549-5469
Popis:	High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::542a4c41e85202b42bdc05c9ab6d5725 https://hal.science/hal-03165261 Zobrazit plný text záznamu