Autor: |
Hoefler, Torsten, Bonoto, Tommaso, De Sensi, Daniele, Di Girolamo, Salvatore, Li, Shigang, Heddes, Marco, Goel, Deepak, Castro, Miguel, Scott, Steve |
Předmět: |
|
Zdroj: |
Communications of the ACM; Dec2024, Vol. 67 Issue 12, p97-105, 9p |
Abstrakt: |
This article presents HammingMesh, a flexible topology that overcomes current high-performance computing’s inability to support deep-learning workloads by allowing for the adjustment of the ratio of local and global bandwidth. The article discusses communication in distributed deep learning with a look at data parallelism, pipeline parallelism, and operator parallelism. Topics include both bisection and global bandwidth, logical job topologies and failures, and microbenchmarks in HammingMesh. |
Databáze: |
Complementary Index |
Externí odkaz: |
|