Caching Cost Model for In-memory Data Analytics Framework

Autor: Hwansoo Han, Seongsoo Park, Minseop Jeong
Rok vydání: 2020
Předmět:
Zdroj: SMA
DOI: 10.1145/3426020.3426070
Popis: In the era of data-parallel analytics, caching intermediate results is used as a key method to speed up the framework. Existing frameworks apply various caching policies depending on run-time context or programmer’s decision. Since caching still leave room for optimization, sophisticated caching which considering the benefit from caching is required. However, existing frameworks are limited to measure the performance benefit from caching because they only measure the computing time at the distributed task level. In this paper, we propose an operator-level computing time metric and a cost model to predict the performance benefit from caching, for in-memory data analytics frameworks. We implemented our scheme in Apache Spark and evaluated its prediction accuracy with Spark benchmark programs. The average error of the cost model measured from 10x input data was 7.3%, and the performance benefit predicted by the model and actual performance benefit showed a difference within 24%. The proposed cost model and performance benefit prediction method can be used to determine and optimize the caching of data analytics engines to maximize the performance benefit.
Databáze: OpenAIRE