Efficient distributed machine learning for large-scale models by reducing redundant communication
Autor: | Harumichi Yokoyama, Takuya Araki |
---|---|
Rok vydání: | 2017 |
Předmět: |
Ethernet
Distributed database Semantics (computer science) Computer science business.industry 02 engineering and technology 010501 environmental sciences Machine learning computer.software_genre 01 natural sciences Data modeling 020204 information systems Server 0202 electrical engineering electronic engineering information engineering Communication methods Overhead (computing) Artificial intelligence business computer Scale model 0105 earth and related environmental sciences |
Zdroj: | SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI |
DOI: | 10.1109/uic-atc.2017.8397638 |
Popis: | Distributed machine learning is used to train large-scale models within a moderate amount of time. To accelerate the training, all nodes have to exchange calculation results frequently. However, the communication of the updated parameters is a large overhead affecting the total execution time. This paper proposes a communication method to decrease redundant transmissions of parameters without changing the semantics of the algorithm. Before the training, we identify the parts of the collective communication that can be omitted and replace them with direct communication between the nodes requesting the intermediate results. We implemented this algorithm and evaluated it on a cluster of five nodes connected with 10-Gbps Ethernet. The evaluation using a real dataset showed that our method reduced the number of elements exchanged between nodes and shortened the communication time. |
Databáze: | OpenAIRE |
Externí odkaz: |