Beating CountSketch for Heavy Hitters in Insertion Streams
Autor: | Stephen R. Chestnut, Vladimir Braverman, David P. Woodruff, Nikita Ivkin |
---|---|
Rok vydání: | 2015 |
Předmět: |
FOS: Computer and information sciences
Data stream mining 0102 computer and information sciences 02 engineering and technology Data structure Space (mathematics) 01 natural sciences Upper and lower bounds Combinatorics symbols.namesake 010201 computation theory & mathematics 020204 information systems Norm (mathematics) Computer Science - Data Structures and Algorithms 0202 electrical engineering electronic engineering information engineering symbols Data Structures and Algorithms (cs.DS) Without loss of generality Constant (mathematics) Gaussian process Algorithm Mathematics |
Zdroj: | STOC |
DOI: | 10.48550/arxiv.1511.00661 |
Popis: | Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items $j$ for which $f_j \geq \epsilon \sqrt{F_2}$, where $f_j$ is the number of occurrences of item $j$ in the stream, and $F_2 = \sum_{i \in [n]} f_i^2$. Such a guarantee is considerably stronger than the $\ell_1$-guarantee, which finds those $j$ for which $f_j \geq \epsilon m$. In 2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data structure, which finds all such $j$ using $\Theta(\log^2 n)$ bits of space (for constant $\epsilon > 0$). The only known lower bound is $\Omega(\log n)$ bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve $O(\log n \log \log n)$ bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including (1) The first algorithm for estimating $F_2$ simultaneously at all points in a stream using only $O(\log n\log\log n)$ bits of space, improving a natural union bound and the algorithm of Huang, Tai, and Yi (2014). (2) A way to estimate the $\ell_{\infty}$ norm of a stream up to additive error $\epsilon \sqrt{F_2}$ with $O(\log n\log\log n)$ bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams. |
Databáze: | OpenAIRE |
Externí odkaz: |