Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Autor: Jieyang Chen, Mark Raugas, Jesun Sahariar Firoz, Ang Li, Shuaiwen Leon Song, Chenhao Xie, Kevin J. Barker, Jiajia Li
Rok vydání: 2021
Předmět:
Zdroj: ICPP
DOI: 10.1145/3472456.3472478
Popis: Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a challenging task due to significant irregular memory references and workload imbalance across GPUs. These challenges are particularly compounded in the case of Sparse Triangular Solver (SpTRSV), which introduces additional complexity of two-dimensional computation dependencies among subsequent computation steps. Dependency information may need to be exchanged and shared among GPUs, thus warranting for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we focus on designing algorithm for SpTRSV in a single-node, multi-GPU setting. We demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve an average of 3.53 × (up to 9.86 ×) speedup on a DGX-1 system and 3.66 × (up to 9.64 ×) speedup on a DGX-2 system with four GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU systems.
Databáze: OpenAIRE