gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections
Autor: | Giovanna Rosone, Simon Gog, Nicola Prezza, Guilherme P. Telles, Felipe A. Louza |
---|---|
Rok vydání: | 2020 |
Předmět: |
Theoretical computer science
lcsh:QH426-470 String collections Burrows–Wheeler transform Burrows-Wheeler transform Document array LCP array Suffix array Computer science 0102 computer and information sciences 01 natural sciences law.invention 03 medical and health sciences Structural Biology law Data_FILES lcsh:QH301-705.5 Molecular Biology 030304 developmental biology 0303 health sciences Settore INF/01 - Informatica Applied Mathematics String (computer science) Search engine indexing Construct (python library) Data structure Software Article lcsh:Genetics lcsh:Biology (General) Computational Theory and Mathematics 010201 computation theory & mathematics Suffix |
Zdroj: | Algorithms for Molecular Biology : AMB Algorithms for Molecular Biology, Vol 15, Iss 1, Pp 1-5 (2020) |
ISSN: | 1748-7188 |
Popis: | Background The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows–Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result In this paper we introduce , an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22–39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections. |
Databáze: | OpenAIRE |
Externí odkaz: |