Various improvements to text fingerprinting

Autor: Roman Kolpakov, Djamal Belazzougui, Mathieu Raffinot
Rok vydání: 2013
Předmět:
Zdroj: Journal of Discrete Algorithms. 22:1-18
ISSN: 1570-8667
DOI: 10.1016/j.jda.2013.06.004
Popis: Let s=s"1..s"n be a text (or sequence) on a finite alphabet @S of size @s. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s"i..s"j is a maximal location for a fingerprint [email protected]?F (denoted by ) if the alphabet of s"i..s"j is f and s"i"-"1, s"j"+"1, if defined, are not in f. The set of maximal locations in s is L (it is easy to see that |L|= and such that s"i..s"j=s"k..s"l are named copies, and the quotient set of L according to the copy relation is denoted by L"C. We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute F; (2) given f as a set of distinct characters in @S, to answer if f represents a fingerprint in F; (3) given f, to find all maximal locations of f in s. As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O(n+|L"C|[email protected]) worst-case time (in this paper all logarithms are intended as base two logarithms) using O((n+|L"C|+|F|[email protected])logn) bits of space, or in O(n+|L|[email protected]) randomized expected time using O((n+|F|[email protected])logn) bits of space. Problem 2 is solved either in O(|f|) expected time if only O(|f|logn) bits of working space for queries is allowed, or in worst-case O(|f|/@e) time if a working space of O(@s^@elogn) bits is allowed (with @e a constant satisfying 0
Databáze: OpenAIRE