Various improvements to text fingerprinting

Autor:	Roman Kolpakov, Djamal Belazzougui, Mathieu Raffinot
Rok vydání:	2013
Předmět:	FOS: Computer and information sciences Sequence Discrete Mathematics (cs.DM) 010102 general mathematics 0102 computer and information sciences Base (topology) 01 natural sciences Substring Computer Science - Information Retrieval Theoretical Computer Science Combinatorics Succinct data structure Computational Theory and Mathematics 010201 computation theory & mathematics Bounded function Computer Science - Data Structures and Algorithms Discrete Mathematics and Combinatorics Order (group theory) Data Structures and Algorithms (cs.DS) 0101 mathematics Time complexity Information Retrieval (cs.IR) Quotient Computer Science - Discrete Mathematics Mathematics
Zdroj:	Journal of Discrete Algorithms. 22:1-18
ISSN:	1570-8667
DOI:	10.1016/j.jda.2013.06.004
Popis:	Let s=s"1..s"n be a text (or sequence) on a finite alphabet @S of size @s. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s"i..s"j is a maximal location for a fingerprint [email protected]?F (denoted by ) if the alphabet of s"i..s"j is f and s"i"-"1, s"j"+"1, if defined, are not in f. The set of maximal locations in s is L (it is easy to see that \|L\|= and such that s"i..s"j=s"k..s"l are named copies, and the quotient set of L according to the copy relation is denoted by L"C. We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute F; (2) given f as a set of distinct characters in @S, to answer if f represents a fingerprint in F; (3) given f, to find all maximal locations of f in s. As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O(n+\|L"C\|[email protected]) worst-case time (in this paper all logarithms are intended as base two logarithms) using O((n+\|L"C\|+\|F\|[email protected])logn) bits of space, or in O(n+\|L\|[email protected]) randomized expected time using O((n+\|F\|[email protected])logn) bits of space. Problem 2 is solved either in O(\|f\|) expected time if only O(\|f\|logn) bits of working space for queries is allowed, or in worst-case O(\|f\|/@e) time if a working space of O(@s^@elogn) bits is allowed (with @e a constant satisfying 0
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::1e5275d0908edaf109f5e1360f56060b https://doi.org/10.1016/j.jda.2013.06.004 Zobrazit plný text záznamu Full Text from ScienceDirect