Popis: |
Entering the big data era, wide table, which contains thousands of columns, is being widely adopted in cloud systems for its facilitation to critical fields such as warehouse/log analysis system, scientific applications and RDF storage. Although taking the advantage of avoiding expensive join in distributed environment, wide table puts an urgent demand for fast scan during data processing. However, represented by horizontal row-store, vertical column store and hybrid columnar store, existing promising data placement structures are witnessing a massive waste of computing resources on disk seeks, due to their ignorance of column order in physical data placement. To make the matter worse, some superior placement methods require additional adjustment to query processing engine or violate the data replication strategy and thus changed error-prone and incur additional disk seeks. In this paper, we explore a Workload Aware Column Order solution, WACO, to boost scan operator in wide table. The new WACO solution maximizes the sequential disk access and turns out to be transparent to underlying cloud systems, which therefore does not exhibit any above-mentioned shortcomings. In particular, acquire query workload, we exploit the recent access patterns on wide table and their frequencies via query logs. Given access patterns and their workload, we investigate the column placement strategy and proof that it belongs to NP-hard to figure out the optimal column layout. Furthermore, we propose a linear programming based solution to efficiently obtain an effective column order in internal physical placement. To make our solution robust and practical, we implement such scan-optimized data placement strategy as a library and thus it is a seamless integration with the underlying system and does not require any adjustment to existing system. We conduct extensive experiments on real-world TPC-H benchmark and SDSS dataset for simulate wide table to demonstrate the superiority of our solution. The experiment results show that our approach is 2x faster than the state-of-the-art. |