Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Autor:	Jieyang Chen, Mark Raugas, Jesun Sahariar Firoz, Ang Li, Shuaiwen Leon Song, Chenhao Xie, Kevin J. Barker, Jiajia Li
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Speedup Computer science Parallel computing Solver Supernode Computer Science - Distributed Parallel and Cluster Computing Hardware Architecture (cs.AR) Scalability Synchronization (computer science) Overhead (computing) Distributed Parallel and Cluster Computing (cs.DC) Partitioned global address space Computer Science - Hardware Architecture Execution model
Zdroj:	ICPP
DOI:	10.1145/3472456.3472478
Popis:	Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a challenging task due to significant irregular memory references and workload imbalance across GPUs. These challenges are particularly compounded in the case of Sparse Triangular Solver (SpTRSV), which introduces additional complexity of two-dimensional computation dependencies among subsequent computation steps. Dependency information may need to be exchanged and shared among GPUs, thus warranting for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we focus on designing algorithm for SpTRSV in a single-node, multi-GPU setting. We demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve an average of 3.53 × (up to 9.86 ×) speedup on a DGX-1 system and 3.66 × (up to 9.64 ×) speedup on a DGX-2 system with four GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU systems.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::cbb54f5c54934ee0e6bc9f7425cbd90f https://doi.org/10.1145/3472456.3472478 Zobrazit plný text záznamu