Efficient distributed machine learning for large-scale models by reducing redundant communication

Autor: Harumichi Yokoyama, Takuya Araki
Rok vydání: 2017
Předmět:
Zdroj: SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI
DOI: 10.1109/uic-atc.2017.8397638
Popis: Distributed machine learning is used to train large-scale models within a moderate amount of time. To accelerate the training, all nodes have to exchange calculation results frequently. However, the communication of the updated parameters is a large overhead affecting the total execution time. This paper proposes a communication method to decrease redundant transmissions of parameters without changing the semantics of the algorithm. Before the training, we identify the parts of the collective communication that can be omitted and replace them with direct communication between the nodes requesting the intermediate results. We implemented this algorithm and evaluated it on a cluster of five nodes connected with 10-Gbps Ethernet. The evaluation using a real dataset showed that our method reduced the number of elements exchanged between nodes and shortened the communication time.
Databáze: OpenAIRE