Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures
Autor: | Wai Teng Tang, Rick Siow Mong Goh, Mian Lu, Huynh Phung Huynh, Ruizhe Zhao, Yun Liang |
---|---|
Rok vydání: | 2017 |
Předmět: |
010302 applied physics
Coprocessor Speedup Computer science Performance tuning Sparse matrix-vector multiplication 02 engineering and technology Parallel computing Intrinsics 01 natural sciences Computer Graphics and Computer-Aided Design 020202 computer hardware & architecture Instruction set 0103 physical sciences Parallel programming model 0202 electrical engineering electronic engineering information engineering Multiplication Electrical and Electronic Engineering Software Xeon Phi |
Zdroj: | IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 36:2106-2119 |
ISSN: | 1937-4151 0278-0070 |
DOI: | 10.1109/tcad.2017.2681072 |
Popis: | Sparse matrix-vector multiplication (SpMV) is one of the most important kernels for many applications. In this paper, we study the implementation of SpMV for scale-free matrices on many-core architectures including graphic processing units and Xeon Phi coprocessors. We first propose a hardware oblivious implementation for heterogeneous many-core processors using OpenCL. Our OpenCL implementation uses a novel SpMV format called hybrid COO+CSR (HCC), which employs 2-D jagged partitioning to balance the workload among a large number of cores and improve the data locality. Moreover, the OpenCL implementation is designed to be parametric, which allows systematic performance tuning. We conduct experiments to evaluate the efficiency of our hardware oblivious implementation. Experiments show that it achieves comparable performance to the Intel MKL and state-of-the-art OpenCL-based ViennaCL library implementation. Although the OpenCL implementation provides functional portability for heterogeneous systems, it fails to take advantage of the low-level architectural features. To further improve the performance, we propose a hardware conscious implementation using the native parallel programming language. We use the Xeon Phi platform as a case study. In our hardware conscious implementation, we ensure that the HCC format efficiently utilizes the vector process units on Xeon Phi by employing low-level intrinsics, and improve the overall performance through locality-aware block mapping, and intrablock tiling. Experiments using a wide range of representative scale-free matrices demonstrate that compared with the OpenCL-based hardware oblivious implementation, the hardware conscious implementation achieves $2.2\boldsymbol {\times }$ speedup on average. Compared with MKL, the hardware conscious implementation achieves $3.1\boldsymbol {\times }$ speedup on Xeon Phi. |
Databáze: | OpenAIRE |
Externí odkaz: |