3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

Autor: Viveck R. Cadambe, Vipul Gupta, Tze Meng Low, Yaoqing Yang, Pulkit Grover, Kannan Ramchandran, Christian Engelmann, Haewon Jeong
Rok vydání: 2020
Předmět:
Zdroj: Euro-Par 2020: Parallel Processing ISBN: 9783030576745
Euro-Par
Popis: In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires \(\sim \)50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
Databáze: OpenAIRE