TSEngine: Enable Efficient Communication Overlay in Distributed Machine Learning in WANs

Autor: Huaman Zhou, Cai Weibo, Zonghang Li, Hongfang Yu, Ling Liu, Long Luo, Gang Sun
Rok vydání: 2021
Předmět:
Zdroj: IEEE Transactions on Network and Service Management. 18:4846-4859
ISSN: 2373-7379
DOI: 10.1109/tnsm.2021.3106315
Popis: In recent years, distributed machine learning in WANs (DML-WANs), i.e., collaboratively training a high-quality ML model cross geo-distributed micro-clouds or edge devices, has attracted attention and been widely applied. Compared with cloud-centric training, DML-WANs avoids the high cost of transferring large amounts of raw data to a central cloud and privacy concerns. However, performing DML-WANs still faces challenges. Model synchronization, an essential step of DML-WANs, is accompanied by a lot of model communication cross limited-bandwidth WANs, which generates high communication overhead. Moreover, the parameter server system, which has been widely used, performs model synchronization in a centralized manner, resulting in serious communication in-cast problem. Such communication in-cast further raises the communication overhead, leading to the low efficiency of DML-WANs. To alleviate the communication in-cast, existing researches attempt to build tree-based communication overlays over the parameter server and workers. However, we identify that these approaches can not adapt to the dynamic and heterogeneous network of DML-WANs, resulting in insufficient improvements. This paper proposes TSEngine, an adaptive communication scheduler for efficient communication overlay of the parameter server system in DML-WANs. Its core idea is to dynamically schedule the communication logic over the parameter server and workers based on the active network perception. Specifically, we propose novel communication scheduling protocols for model distribution and model aggregation, respectively. We have implemented TSEngine in a mainstream parameter server system and verified its effectiveness in DML-WANs testbeds.
Databáze: OpenAIRE