Characterization and prediction of deep learning workloads in large-scale GPU datacenters

Autor:	Tianwei Zhang, Shengen Yan, Peng Sun, Yonggang Wen, Qinghao Hu
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Computer Science - Machine Learning Service (systems architecture) Computer science business.industry Deep learning Distributed computing Machine Learning (cs.LG) Scheduling (computing) Energy conservation Computer Science - Distributed Parallel and Cluster Computing Scale (social sciences) Cluster (physics) Resource management Distributed Parallel and Cluster Computing (cs.DC) Artificial intelligence Time series business
Zdroj:	SC
DOI:	10.1145/3458817.3476223
Popis:	Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%. Comment: This paper has been accepted by the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), Nov 14-19, 2021, St. Louis, USA
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::68392aa6cbb84f2c52356c7d81a691be https://doi.org/10.1145/3458817.3476223 Zobrazit plný text záznamu