Selecting resources for distributed dataflow systems according to runtime targets
Autor: | Ilya Verbitskiy, Thomas Renner, Florian Schmidt, Odej Kao, Lauritz Thamsen |
---|---|
Rok vydání: | 2016 |
Předmět: |
Dataflow
Computer science Model selection Distributed computing 02 engineering and technology Yarn Data modeling Set (abstract data type) 020204 information systems visual_art Spark (mathematics) 0202 electrical engineering electronic engineering information engineering visual_art.visual_art_medium Data analysis 020201 artificial intelligence & image processing Resource management (computing) |
Zdroj: | IPCCC |
DOI: | 10.1109/pccc.2016.7820629 |
Popis: | Distributed dataflow systems like Spark or Flink enable users to analyze large datasets. Users create programs by providing sequential user-defined functions for a set of well-defined operations, select a set of resources, and the systems automatically distribute the jobs across these resources. However, selecting resources for specific performance needs is inherently difficult and users consequently tend to overprovision, which results in poor cluster utilization. At the same time, many important jobs are executed recurringly in production clusters. This paper presents Bell, a practical system that monitors job execution, models the scale-out behavior of jobs based on previous runs, and selects resources according to user-provided runtime targets. Bell automatically chooses between different runtime prediction models to optimally support different distributed dataflow systems. Bell is implemented as a job submission tool for YARN and, thus, works with existing cluster setups. We evaluated Bell's runtime prediction with six exemplary data analytics jobs using both Spark and Flink. We present the learned scale-out models for these jobs and evaluate the relative prediction error using cross-validation, showing that our model selection approach provides better overall performance than the individual prediction models. |
Databáze: | OpenAIRE |
Externí odkaz: |