Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Autor: Wai Teng Tang, Rick Siow Mong Goh, Mian Lu, Huynh Phung Huynh, Ruizhe Zhao, Yun Liang
Rok vydání: 2017
Předmět:
Zdroj: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 36:2106-2119
ISSN: 1937-4151
0278-0070
DOI: 10.1109/tcad.2017.2681072
Popis: Sparse matrix-vector multiplication (SpMV) is one of the most important kernels for many applications. In this paper, we study the implementation of SpMV for scale-free matrices on many-core architectures including graphic processing units and Xeon Phi coprocessors. We first propose a hardware oblivious implementation for heterogeneous many-core processors using OpenCL. Our OpenCL implementation uses a novel SpMV format called hybrid COO+CSR (HCC), which employs 2-D jagged partitioning to balance the workload among a large number of cores and improve the data locality. Moreover, the OpenCL implementation is designed to be parametric, which allows systematic performance tuning. We conduct experiments to evaluate the efficiency of our hardware oblivious implementation. Experiments show that it achieves comparable performance to the Intel MKL and state-of-the-art OpenCL-based ViennaCL library implementation. Although the OpenCL implementation provides functional portability for heterogeneous systems, it fails to take advantage of the low-level architectural features. To further improve the performance, we propose a hardware conscious implementation using the native parallel programming language. We use the Xeon Phi platform as a case study. In our hardware conscious implementation, we ensure that the HCC format efficiently utilizes the vector process units on Xeon Phi by employing low-level intrinsics, and improve the overall performance through locality-aware block mapping, and intrablock tiling. Experiments using a wide range of representative scale-free matrices demonstrate that compared with the OpenCL-based hardware oblivious implementation, the hardware conscious implementation achieves $2.2\boldsymbol {\times }$ speedup on average. Compared with MKL, the hardware conscious implementation achieves $3.1\boldsymbol {\times }$ speedup on Xeon Phi.
Databáze: OpenAIRE