Instance-Optimized Data Layouts for Cloud Analytics Workloads
Autor: | Yinan Li, Johannes Gehrke, Tim Kraska, Jialin Ding, Badrish Chandramouli, Ying Li, Chi Wang, Donald Kossmann, Umar Farooq Minhas |
---|---|
Rok vydání: | 2021 |
Předmět: |
Database
Computer science business.industry Joins 020207 software engineering Cloud computing 02 engineering and technology computer.software_genre Blocking (computing) Analytics 0202 electrical engineering electronic engineering information engineering Table (database) business Cloud storage computer Snowflake schema Block (data storage) |
Zdroj: | SIGMOD Conference |
Popis: | Today, businesses rely on efficiently running analytics on large amounts of operational and historical data to gain business insights and competitive advantage. Increasingly, such analytics are run using cloud-based data analytics services, such as Google BigQuery, Microsoft Azure Synapse, Amazon Redshift, and Snowflake. These services persist and process data in compressed, columnar formats, stored in large blocks, each of which contains thousands or millions of records. For these services, disk I/O from (remote) cloud storage is often one of the dominant costs for query processing. To reduce the amount of I/O, services often maintain per-block metadata, such as zone maps, which are used to skip blocks that are irrelevant to the query, leading to lower query execution times. However, the effectiveness of block skipping via zone maps is dependent on how the records are assigned to blocks. Recent work on instance-optimized data layouts aims to maximize block skipping by specializing the block assignment strategy to a specific dataset and workload. However, these existing approaches only optimize the layout for a single table. In this paper, we propose MTO, an instance-optimized data layout framework that determines the blocking strategy for all tables in a multi-table database in the presence of joins, such as in a star or snowflake schema common in real-world workloads. MTO takes advantage of sideways information passing through joins to jointly optimize the layout for all tables, which results in better block skipping and hence reduced query execution times. Experiments on a commercial cloud-based analytics service show that MTO achieves up to 93% reduction in blocks accessed and 75% reduction in end-to-end query times compared to alternative blocking strategies. |
Databáze: | OpenAIRE |
Externí odkaz: |