Autor:	Hoefler, Torsten, Bonoto, Tommaso, De Sensi, Daniele, Di Girolamo, Salvatore, Li, Shigang, Heddes, Marco, Goel, Deepak, Castro, Miguel, Scott, Steve
Předmět:	DEEP learning HIGH performance computing COMPUTER networks BANDWIDTH allocation ARTIFICIAL neural networks
Zdroj:	Communications of the ACM; Dec2024, Vol. 67 Issue 12, p97-105, 9p
Abstrakt:	This article presents HammingMesh, a flexible topology that overcomes current high-performance computing’s inability to support deep-learning workloads by allowing for the adjustment of the ratio of local and global bandwidth. The article discusses communication in distributed deep learning with a look at data parallelism, pipeline parallelism, and operator parallelism. Topics include both bisection and global bandwidth, logical job topologies and failures, and microbenchmarks in HammingMesh.
Databáze:	Complementary Index
Externí odkaz:	Zobrazit plný text záznamu