Suffix trees for inputs larger than main memory

Autor:	Alex Thomo, Ulrike Stege, Marina Barsky
Rok vydání:	2011
Předmět:	Compressed suffix array Theoretical computer science Computer science Suffix tree String (computer science) Generalized suffix tree String searching algorithm Longest common substring problem law.invention Hardware and Architecture law Data_FILES Suffix Software FM-index Information Systems
Zdroj:	Information Systems. 36:644-654
ISSN:	0306-4379
DOI:	10.1016/j.is.2010.11.001
Popis:	A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algorithms which avoid massive random access to the trees being built. However, all the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The constantly growing pool of string data, especially biological sequences, requires us to build suffix trees for much larger strings. We are the first to present an algorithm which is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them. As a proof of concept, we show that our method allows to build the suffix tree for 12GB of real DNA sequences in 26h on a single machine with 2GB of RAM. This input is four times the size of the Human Genome, and the construction of suffix trees for inputs of such magnitude was never reported before.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::6fb98ee5a45b39a202695256b12cc14a https://doi.org/10.1016/j.is.2010.11.001 Zobrazit plný text záznamu Full Text from ScienceDirect