Various improvements to text fingerprinting
Autor: | Roman Kolpakov, Djamal Belazzougui, Mathieu Raffinot |
---|---|
Rok vydání: | 2013 |
Předmět: |
FOS: Computer and information sciences
Sequence Discrete Mathematics (cs.DM) 010102 general mathematics 0102 computer and information sciences Base (topology) 01 natural sciences Substring Computer Science - Information Retrieval Theoretical Computer Science Combinatorics Succinct data structure Computational Theory and Mathematics 010201 computation theory & mathematics Bounded function Computer Science - Data Structures and Algorithms Discrete Mathematics and Combinatorics Order (group theory) Data Structures and Algorithms (cs.DS) 0101 mathematics Time complexity Information Retrieval (cs.IR) Quotient Computer Science - Discrete Mathematics Mathematics |
Zdroj: | Journal of Discrete Algorithms. 22:1-18 |
ISSN: | 1570-8667 |
DOI: | 10.1016/j.jda.2013.06.004 |
Popis: | Let s=s"1..s"n be a text (or sequence) on a finite alphabet @S of size @s. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s"i..s"j is a maximal location for a fingerprint [email protected]?F (denoted by ) if the alphabet of s"i..s"j is f and s"i"-"1, s"j"+"1, if defined, are not in f. The set of maximal locations in s is L (it is easy to see that |L|= and such that s"i..s"j=s"k..s"l are named copies, and the quotient set of L according to the copy relation is denoted by L"C. We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute F; (2) given f as a set of distinct characters in @S, to answer if f represents a fingerprint in F; (3) given f, to find all maximal locations of f in s. As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O(n+|L"C|[email protected]) worst-case time (in this paper all logarithms are intended as base two logarithms) using O((n+|L"C|+|F|[email protected])logn) bits of space, or in O(n+|L|[email protected]) randomized expected time using O((n+|F|[email protected])logn) bits of space. Problem 2 is solved either in O(|f|) expected time if only O(|f|logn) bits of working space for queries is allowed, or in worst-case O(|f|/@e) time if a working space of O(@s^@elogn) bits is allowed (with @e a constant satisfying 0 |
Databáze: | OpenAIRE |
Externí odkaz: |