ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms
Autor: | Srinivas Sridharan, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna |
---|---|
Rok vydání: | 2020 |
Předmět: |
010302 applied physics
Computer science business.industry Deep learning Cloud computing 02 engineering and technology ASTRA Network topology 01 natural sciences 020202 computer hardware & architecture Scheduling (computing) Network simulation Collective communication Computer architecture End-to-end principle 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Artificial intelligence business |
Zdroj: | ISPASS |
Popis: | Modern Deep Learning systems heavily rely on distributed training over high-performance accelerator (e.g., TPU, GPU)-based hardware platforms. Examples today include Google's Cloud TPU and Facebook's Zion. DNN training involves a complex interplay between the DNN model architecture, paral-lelization strategy, scheduling strategy, collective communication algorithm, network topology, and the end-point accelerator. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex SW/HW design-space for future systems to support efficient training of future DNN models. In this work, we make the following contributions (i) establish the SW/HW design-space for Distributed Training over a hierarchical scale-up fabric, (ii) develop a network simulator for navigating the design-space, and (iii) demonstrate the promise of algorithm-topology co-design for speeding up end to end training. |
Databáze: | OpenAIRE |
Externí odkaz: |