The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics

Autor: Yiqun Zhang, Wellington Cabrera, Carlos Ordonez
Rok vydání: 2016
Předmět:
Zdroj: IEEE Transactions on Knowledge and Data Engineering. 28:1905-1918
ISSN: 1041-4347
DOI: 10.1109/tkde.2016.2545664
Popis: Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. On the other hand, array DBMSs enable scalable computation with large matrices. With that motivation in mind, we propose a parallel array operator, based on a specific form of matrix multiplication, that computes a comprehensive data summarization matrix. By deriving equivalent equations based on the summarization matrix, statistical methods are adapted to work in two phases: (1) Parallel summarization of the data set in one pass; (2) Iteration exploiting the summarization matrix in many intermediate computations. We prove our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work faster in main memory, by decreasing the number of times the data set is scanned, and by reducing the number of CPU operations. Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression, and variable selection. From a systems perspective, we carefully study the efficient computation of the summarization matrix on the SciDB parallel array DBMS and how to exploit it in the R language statistical system. To achieve best performance, we introduce two specialized array operators for dense and sparse data sets, respectively. We present an experimental evaluation comparing SciDB, R, a columnar DBMS (a fast SQL engine), and Spark (a popular Hadoop system). Our experiments show R working together with SciDB eliminates main memory and performance limitations from R. More importantly, our R+SciDB prototype is significantly faster and more scalable than Spark and the columnar DBMS.
Databáze: OpenAIRE