Statistical properties of sketching algorithms
Autor: | William J. Astle, Sylvia Richardson, Daniel Ahfock |
---|---|
Přispěvatelé: | Astle, William [0000-0001-8866-6672], Richardson, Sylvia [0000-0003-1998-492X], Apollo - University of Cambridge Repository |
Jazyk: | angličtina |
Rok vydání: | 2017 |
Předmět: |
Statistics and Probability
FOS: Computer and information sciences Sketching Mean squared error General Mathematics Gaussian Inference 010103 numerical & computational mathematics 01 natural sciences Statistics - Computation Article Methodology (stat.ME) 010104 statistics & probability symbols.namesake Hadamard transform 0101 mathematics Computation (stat.CO) Statistics - Methodology Mathematics Central limit theorem Applied Mathematics Random projection Probabilistic logic Estimator Agricultural and Biological Sciences (miscellaneous) Computational efficiency symbols Randomized numerical linear algebra Statistics Probability and Uncertainty General Agricultural and Biological Sciences Algorithm Data compression |
Zdroj: | Biometrika |
Popis: | Sketching is a probabilistic data compression technique that has been largely developed in the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson-Woodruff sketches, and their use in single pass sketching algorithms for linear regression with huge $n$. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketched estimators. A key result is a conditional central limit theorem for data oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean square error is related to the signal to noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two real datasets. added central limit theorem under weaker conditions, unconditional results, corrected expression for variance of partial sketch |
Databáze: | OpenAIRE |
Externí odkaz: |