A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Autor:	Jianping Wu, Dan Li, Shuai Wang, Shu-Tao Xia, Songtao Wang, Yanshu Wang, Jinkun Geng, Yang Cheng
Rok vydání:	2020
Předmět:	Ethernet Network architecture Computer Networks and Communications Computer science business.industry Testbed Local area network 020206 networking & telecommunications Fault tolerance 02 engineering and technology Machine learning computer.software_genre Synchronization Computer Science Applications Server Scalability 0202 electrical engineering electronic engineering information engineering Network performance Artificial intelligence Electrical and Electronic Engineering business Fat tree computer Software
Zdroj:	IEEE/ACM Transactions on Networking. 28:1752-1764
ISSN:	1558-2566 1063-6692
Popis:	In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML , a scalable, high-performance and fault-tolerant DML network architecture on top of Ethernet and commodity devices. BML builds on BCube topology, and runs a fully-distributed gradient synchronization algorithm. Compared to a Fat-Tree network with the same size, a BML network is expected to take much less time for gradient synchronization, for both low theoretical synchronization time and its benefit to RDMA transport. With server/link failures, the performance of BML degrades in a graceful way. Experiments of MNIST and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4% compared with Fat-Tree running state-of-the-art gradient synchronization algorithm.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::e579b9e81a746983be41ac91a4a91094 https://doi.org/10.1109/tnet.2020.2999377 Zobrazit plný text záznamu