KEA: Tuning an Exabyte-Scale Data Infrastructure
Autor: | Kartheek Muthyala, Sudhir Darbha, Abhishek Modi, Minu Iyer, Subru Krishnan, Nick Jurgens, Ankita Agarwal, Conor Power, Konstantinos Karanasos, Deli Zhang, Manoj Kumar, Yiwen Zhu, Isha Tarte, Carlo Curino, Sarvesh Sakalanaga |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2021 |
Předmět: |
Exabyte
FOS: Computer and information sciences Computer science business.industry Scale (chemistry) CPU time 020207 software engineering Databases (cs.DB) 02 engineering and technology Capacity management Industrial engineering Software Computer Science - Databases 020204 information systems Limit (music) 0202 electrical engineering electronic engineering information engineering Domain knowledge Data center business |
Popis: | Microsoft's internal big-data infrastructure is one of the largest in the world -- with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficiency (>60% average CPU utilization across all clusters). Despite rich telemetry and strong expertise, faced with evolving hardware/software/workloads this manual tuning approach had reached its limit -- we had plateaued. In this paper, we present KEA, a multi-year effort to automate our tuning processes to be fully data/model-driven. KEA leverages a mix of domain knowledge and principled data science to capture the essence of our cluster dynamic behavior in a set of machine learning (ML) models based on collected system data. These models power automated optimization procedures for parameter tuning, and inform our leadership in critical decisions around engineering and capacity management (such as hardware and data center design, software investments, etc.). We combine "observational" tuning (i.e., using models to predict system behavior without direct experimentation) with judicious use of "flighting" (i.e., conservative testing in production). This allows us to support a broad range of applications that we discuss in this paper. KEA continuously tunes our cluster configurations and is on track to save Microsoft tens of millions of dollars per year. At the best of our knowledge, this paper is the first to discuss research challenges and practical learnings that emerge when tuning an exabyte-scale data infrastructure. |
Databáze: | OpenAIRE |
Externí odkaz: |