Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library
Autor: | Michael Bussmann, Benjamin Worpitz, Axel Huebl, René Widera, Erik Zenker, Alexander Matthes |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2017 |
Předmět: |
FOS: Computer and information sciences
Source code Floating point Computer science media_common.quotation_subject POWER8 CUDA Parallel computing computer.software_genre 01 natural sciences Basic Linear Algebra Subprograms Hardware abstraction 010305 fluids & plasmas 0103 physical sciences Code (cryptography) 010306 general physics C++ media_common Window (computing) OpenMP Platform portability Performance portability Computer Science - Distributed Parallel and Cluster Computing Parameter tuning HPC Distributed Parallel and Cluster Computing (cs.DC) Compiler Heterogeneous computing computer |
Zdroj: | 2nd International Workshop on Performance Portable Programming Models for Accelerators (P^3MA), 22.06.2017, Frankfurt am Main, DeutschlandISC High Performance 2017: High Performance Computing, Vol 10524, 496-514 Lecture Notes in Computer Science Lecture Notes in Computer Science-High Performance Computing 2nd International Workshop on Performance Portable Programming Models for Accelerators (P^3MA), 22.06.2017, Frankfurt am Main, Deutschland International Conference on High Performance Computing Lecture Notes in Computer Science ISBN: 9783319676296 ISC Workshops |
ISSN: | 0302-9743 1611-3349 |
Popis: | We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL) and Haswell architecture as well as IBM's Power8 system. On some of these we are able to reach almost 50\% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific #pragmas we are able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system. Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfurt |
Databáze: | OpenAIRE |
Externí odkaz: |