Data structures based on k -mers for querying large collections of sequencing data sets

Autor: Paul Medvedev, Simon J. Puglisi, Camille Marchet, Christina Boucher, Mikaël Salson, Rayan Chikhi
Přispěvatelé: Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), University of Florida [Gainesville] (UF), Helsingin yliopisto = Helsingfors universitet = University of Helsinki, Pennsylvania State University (Penn State), Penn State System, ANR-18-CE45-0020,Transipedia,Signatures transcriptionnelles pour une analyse RNA-seq globale(2018), University of Helsinki, Université de Lille-Ecole Centrale de Lille-Centre National de la Recherche Scientifique (CNRS), Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Séquence, Structure et Fonction des ARN (SSFA), Département Biologie des Génomes (DBG), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut de Biologie Intégrative de la Cellule (I2BC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Department of Computer Science, Helsinki Institute for Information Technology, Algorithmic Bioinformatics, Bioinformatics
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Zdroj: Genome Research
Genome Research, 2021, 31 (1), pp.1-12. ⟨10.1101/gr.260604.119⟩
Genome Research, Cold Spring Harbor Laboratory Press, 2021, 31 (1), pp.1-12. ⟨10.1101/gr.260604.119⟩
Genome Res
ISSN: 1088-9051
1549-5469
Popis: High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Databáze: OpenAIRE