GATB: a software toolbox for genome assembly and analysis

Autor: Drezen, Erwan, Rizk, Guillaume, Chikhi, Rayan, Deltel, Charles, Lemaitre, Claire, Peterlongo, Pierre, Lavenier, Dominique
Přispěvatelé: Scalable, Optimized and Parallel Algorithms for Genomics (GenScale), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-GESTION DES DONNÉES ET DE LA CONNAISSANCE (IRISA-D7), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Dept. of Computer Science and Engineering, Pennsylvania State University (Penn State), Penn State System-Penn State System, ANR-12-EMMA- 0019-01, ANR-12-EMMA-0019,GATB,Boite à outils ' Assemblage pour la Génomique '(2012), CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA), Lavenier, Dominique
Jazyk: angličtina
Rok vydání: 2014
Předmět:
Zdroj: Bio-IT World Conference
Bio-IT World Conference, Apr 2014, Boston, United States
Popis: International audience; The analysis of NGS data remains a time and space-consuming task. Many efforts have been made to provide efficient data structures for indexing the terabytes of data generated by the fast sequencing machines (Suffix Array, Burrows-Wheeler transform, Bloom Filter, etc.). Mapper tools, genome assemblers, SNP callers, etc., make an intensive use of these data structures to keep their memory footprint as lower as possible.The overall efficiency of NGS software is brought by a smart combination of how data are represented inside the computer memory and how they are processed through the available processing units inside a processor. Developing such software is thus a real challenge, as it requires a large spectrum of competences from high-level data structure and algorithm concepts to tiny details of implementation.GATB toolboxThe GATB software toolbox aims to lighten the design of NGS algorithms. It offers a panel of high-level optimized building blocks to speed-up the development of NGS tools related to genome assembly and/or genome analysis. The underlying data structure is the de Bruijn graph, and the general parallelism model is multithreading. The GATB library targets standard computing resources such as current multicore processor (laptop computer, small server) with a few GB of memory. From high-level C++ API, NGS programing designers can rapidly elaborate their own software based on state-of-the-art algorithms and data structures of the domain.The GATB library is written in C++ and is available at the following web site http://gatb.inria.fr under the GNU Affero GPL license.Genomic SoftwareFrom the GATB toolbox, various software targeting specific genomic treatments have been designed. Below is a short list of tools currently available. Many other tools are under development.Minia is a short-read assembler capable of assembling large and complex genomes into contigs on a desktop computer. The assembler produces contigs of similar length and accuracy compared to other assemblers. As an example, a Boa constrictor constrictor (1.6 Gbp) dataset (Illumina 2x120 bp reads, 125x coverage) from Assemblathon 2 can be processed in approximately 45 hours and 3GB of memory on a standard computer (3.4 GHz 8-core processor) using a single core, yielding a contig N50 of 3.6 Kbp (prior to scaffolding and gap-filling).Bloocoo is a k-mer spectrum-based read error corrector, designed to correct large datasets with a very low memory footprint. The correction procedure is similar to the Musket multistage approach. Bloocoo yields similar results while requiring far less memory: as an example, it can correct whole human genome re-sequencing reads at 70 x coverage with less than 4GB of memory.DiscoSNP aims to discover Single Nucleotide Polymorphism (SNP) from non-assembled reads. Applied on a mouse dataset (2.88 Gbp, 100 bp Illumina reads), DiscoSnp takes 34 hours and at most 4.5 GB RAM memory. In the same spirit, the TakeABreak software discovers inversions from non-assembled reads. It directly finds particular patterns in the de Bruijn Graph, and provides execution performances similar to DiscoSNP.
Databáze: OpenAIRE