Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

Autor:	Qinyi Luo, Jiaao He, Youwei Zhuo, Xuehai Qian
Rok vydání:	2020
Předmět:	Speedup Computer science Distributed computing Serialization Parallel algorithm 020206 networking & telecommunications 02 engineering and technology 010501 environmental sciences Deadlock 01 natural sciences Asynchronous communication Synchronization (computer science) Convergence (routing) 0202 electrical engineering electronic engineering information engineering Overhead (computing) 0105 earth and related environmental sciences
Zdroj:	ASPLOS
DOI:	10.1145/3373376.3378499
Popis:	Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers. For this reason, it is significantly slower in heterogeneous settings. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds --- designing a distributed training method that has both high performance like All-Reduce in homogeneous environment and good heterogeneity tolerance like AD-PSGD? In this paper, we propose Prague, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization by exploring the interplay between algorithm and system implementation, or statistical and hardware efficiency. To reduce synchronization cost, we propose a novel communication primitive, Partial All-Reduce, that enables fast synchronization among a group of workers. To reduce serialization cost, we propose static group scheduling in homogeneous environment and simple techniques, i.e., Group Buffer and Group Division, to largely eliminate conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Prague is 1.2x faster than the state-of-the-art implementation of All-Reduce, 5.3x faster than Parameter Server and 3.7x faster than AD-PSGD. In a heterogeneous setting, Prague tolerates slowdowns well and achieves 4.4x speedup over All-Reduce.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::2487f7248152f197eda6b53919f063de https://doi.org/10.1145/3373376.3378499 Zobrazit plný text záznamu