Fast Label Extraction in the CDAWG

Autor: Fabio Cunial, Djamal Belazzougui
Rok vydání: 2017
Předmět:
Zdroj: String Processing and Information Retrieval ISBN: 9783319674278
SPIRE
DOI: 10.1007/978-3-319-67428-5_14
Popis: The compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from \(O(m\log {\log {n}}+\mathtt {occ})\) to \(O(m+\mathtt {occ})\) in the time needed to locate all the \(\mathtt {occ}\) occurrences of the pattern. We also reduce from \(O(k\log {\log {n}})\) to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.
Databáze: OpenAIRE