The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics
Autor: | Yiqun Zhang, Wellington Cabrera, Carlos Ordonez |
---|---|
Rok vydání: | 2016 |
Předmět: |
Computer science
Statistical model Feature selection 02 engineering and technology Automatic summarization Matrix multiplication Computer Science Applications Computational science Data modeling Data set Matrix (mathematics) Computational Theory and Mathematics 020204 information systems Principal component analysis Spark (mathematics) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Algorithm Information Systems Sparse matrix |
Zdroj: | IEEE Transactions on Knowledge and Data Engineering. 28:1905-1918 |
ISSN: | 1041-4347 |
DOI: | 10.1109/tkde.2016.2545664 |
Popis: | Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. On the other hand, array DBMSs enable scalable computation with large matrices. With that motivation in mind, we propose a parallel array operator, based on a specific form of matrix multiplication, that computes a comprehensive data summarization matrix. By deriving equivalent equations based on the summarization matrix, statistical methods are adapted to work in two phases: (1) Parallel summarization of the data set in one pass; (2) Iteration exploiting the summarization matrix in many intermediate computations. We prove our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work faster in main memory, by decreasing the number of times the data set is scanned, and by reducing the number of CPU operations. Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression, and variable selection. From a systems perspective, we carefully study the efficient computation of the summarization matrix on the SciDB parallel array DBMS and how to exploit it in the R language statistical system. To achieve best performance, we introduce two specialized array operators for dense and sparse data sets, respectively. We present an experimental evaluation comparing SciDB, R, a columnar DBMS (a fast SQL engine), and Spark (a popular Hadoop system). Our experiments show R working together with SciDB eliminates main memory and performance limitations from R. More importantly, our R+SciDB prototype is significantly faster and more scalable than Spark and the columnar DBMS. |
Databáze: | OpenAIRE |
Externí odkaz: |