HammingMesh: A Network Topology for Large-Scale Deep Learning.

Autor: Hoefler, Torsten, Bonoto, Tommaso, De Sensi, Daniele, Di Girolamo, Salvatore, Li, Shigang, Heddes, Marco, Goel, Deepak, Castro, Miguel, Scott, Steve
Předmět:
Zdroj: Communications of the ACM; Dec2024, Vol. 67 Issue 12, p97-105, 9p
Abstrakt: This article presents HammingMesh, a flexible topology that overcomes current high-performance computing’s inability to support deep-learning workloads by allowing for the adjustment of the ratio of local and global bandwidth. The article discusses communication in distributed deep learning with a look at data parallelism, pipeline parallelism, and operator parallelism. Topics include both bisection and global bandwidth, logical job topologies and failures, and microbenchmarks in HammingMesh.
Databáze: Complementary Index