DROP

Autor: Sahaana Suri, Peter Bailis
Rok vydání: 2019
Předmět:
Zdroj: DEEM@SIGMOD
Popis: Dimensionality reduction (DR) is critical in scaling machine learning pipelines: by reducing input dimensionality in exchange for a preprocessing overhead, DR enables faster end-to-end runtime. Principal component analysis (PCA) is a DR standard, but can be computationally expensive: classically O(dn2 + n3) for an n-dimensional dataset of d points. Theoretical work has optimized PCA via iterative, sample-based stochastic methods. However, these methods execute for a fixed number of iterations or to convergence, sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after processing small (e.g., 1%) samples of data. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5x over Singular-Value-Decomposition (SVD)-based PCA, and 16x over conventional DR methods in end-to-end nearest neighbor workloads.
Databáze: OpenAIRE