From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in ALOJA

Autor: David Carrera, Nikola Vujic, Fabrizio Gagliardi, Aaron Call, Robert L. Reinauer, Daron Green, José A. Blakeley, Josep Ll. Berral, Nicolas Poggi
Rok vydání: 2015
Předmět:
Zdroj: IEEE BigData
DOI: 10.1109/bigdata.2015.7363876
Popis: During the past years the exponential growth of data, its generation speed, and its expected consumption rate presents one of the most important challenges in IT both for industry and research. For these reasons, the ALOJA research project was created by BSC and Microsoft as an open initiative to increase cost-efficiency and the general understanding of Big Data systems via automation and learning. The development of the project over its first year, has resulted in a open source benchmarking platform used to produce the largest public repository of Big Data results1, featuring over 42,000 job execution details. ALOJA also includes web-based analytic tools to evaluate and gather insights about cost-performance of benchmarked systems. The tools offer means to extract knowledge that can lead to optimize configuration and deployment options in the Cloud i.e., selecting the most cost-effective VMs and cluster sizes. This article describes the evolution of the project focus and research lines, for a period of over a year while continuously benchmarking systems for Big Data. As well discusses the motivation — both technical and market-based — of such changes. It also presents the main results from the evaluation of different OS and Hadoop configurations, covering over 100 hardware deployments. During this time, ALOJA's initial target has shifted from a previous low-level profiling of Hadoop runtime with HPC tools, passing through extensive benchmarking and evaluation of a large body of results via aggregation, to currently leveraging Predictive Analytics (PA) techniques. The ongoing efforts in PA show promising results to automatically model the behavior of systems i.e., predicting job execution times with high accuracy or to reduce the number of benchmark runs needed. As well as for Knowledge Discovery (KD) to find relations among software and hardware components. Techniques that jointly support foresighting cost-effectiveness of new defined systems, reducing benchmarking time and costs.
Databáze: OpenAIRE