3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication
Autor: | Viveck R. Cadambe, Vipul Gupta, Tze Meng Low, Yaoqing Yang, Pulkit Grover, Kannan Ramchandran, Christian Engelmann, Haewon Jeong |
---|---|
Rok vydání: | 2020 |
Předmět: |
Computer science
Computation 0202 electrical engineering electronic engineering information engineering Redundancy (engineering) 020206 networking & telecommunications 010103 numerical & computational mathematics 02 engineering and technology Parallel computing 0101 mathematics Error detection and correction 01 natural sciences Matrix multiplication |
Zdroj: | Euro-Par 2020: Parallel Processing ISBN: 9783030576745 Euro-Par |
Popis: | In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires \(\sim \)50% less redundancy than replication, while the overhead in execution time is only about 5–10%. |
Databáze: | OpenAIRE |
Externí odkaz: |