Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads

Autor:	Denis Sheahan, Janet Yang, Lei Tian, Valentin Andrei, Bilge Acun, Cyril Meurillon, Gisle Dankel, Peifeng Yu, Adnan Aziz, Christopher Gregg, Lukasz Wesolowski, Kim Hazelwood, Xiaoqiao Meng
Rok vydání:	2021
Předmět:	Profiling (computer programming) business.industry Computer science Call stack Scale (chemistry) Machine learning computer.software_genre Workflow Stack (abstract data type) Hardware and Architecture Software deployment Component (UML) Server Artificial intelligence Electrical and Electronic Engineering business computer Software
Zdroj:	IEEE Micro. 41:101-112
ISSN:	1937-4143 0272-1732
Popis:	In this article, we present a system to collectively optimize efficiency in a very large scale deployment of GPU servers for machine learning workloads at Facebook. Our system 1) measures and stores system-wide efficiency metrics for every executed workflow; 2) aggregates data from across the execution stack to identify optimization opportunities that maximize fleet-wide efficiency improvements; 3) provides periodic and on-demand whole-system profiling for workflows; and 4) automatically analyzes traces for common antipatterns. We present each component of the stack and show case studies demonstrating the use of the tools to significantly improve performance. To our knowledge, our system is the most complete and effective solution for identifying and addressing efficiency problems in datacenter-scale GPU deployments.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::28d38d31c236b057ae2b76014b945313 https://doi.org/10.1109/mm.2021.3097287 Zobrazit plný text záznamu