Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations

Autor:	Rengan Xu, Barbara Chapman, Xiaonan Tian, Deepak Eachempati, Dounia Khaldi
Rok vydání:	2016
Předmět:	020203 distributed computing Speedup CPU cache Computer science Register file Optimizing compiler 02 engineering and technology Parallel computing computer.software_genre Instruction set CUDA 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Compiler computer Register allocation
Zdroj:	ICPP
DOI:	10.1109/icpp.2016.72
Popis:	Using compiler directives to program accelerator-based systems through APIs such as OpenACC or OpenMP has increasingly gained popularity due to the portability and productivity advantages it offers. However, when comparing the performance typically achieved to what lower-level programming interfaces such as CUDA or OpenCL provides, directive-based approaches may entail a significant performance penalty. Tosupport massively parallel computations, accelerators such as GPGPUs offer an expansive set of registers, larger than even the L1 cache, to hold the temporary state of each thread. Scalar variables are the mostly likely candidates to be assigned to these registers by the compiler. Hence, scalar replacement is a key enabling optimization for effectively improving the utilization of register files on accelerator devices and thereby substantially reducing the cost of memory operations. However, the aggressive application of scalar replacement may require a large number of registers, limiting the application of this technique unless mitigating approaches such as those described in this paper are taken. In this paper, we propose solutions to optimize the register usage within offloaded computations using OpenACC directives. We first present a compiler optimization called SAFARA thatextends the classical scalar replacement algorithm to improve register file utilization on GPUs. Moreover, we extend the OpenACC interface by providing new clauses, namely dim and small, that will reduce the number of scalars to replace. SAFARA prioritizes the most beneficial data for allocation in registers based on frequency of use and also memory access latency. It also uses a static feedback strategy to retrieve low-level register information in order to guide the compiler in carrying out the scalar replacement transformation. Then, the new clauses we propose will extremely reduce the number of scalars, eliminating the need for more registers. We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks, our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::9b23c742e90539308502dfe7074be204 https://doi.org/10.1109/icpp.2016.72 Zobrazit plný text záznamu