Non-Invasive, Memory Access-Triggered Near-Data Processing for DNN Training Acceleration on GPUs

Autor:	Hyungkyu Ham, Hyunuk Cho, Minjae Kim, Jueon Park, Jeongmin Hong, Hyojin Sung, Eunhyeok Park, Euicheol Lim, Gwangsun Kim
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	Near-data processing deep neural networks GPU memory expansion Electrical engineering. Electronics. Nuclear engineering TK1-9971
Zdroj:	IEEE Access, Vol 12, Pp 142651-142667 (2024)
Druh dokumentu:	article
ISSN:	2169-3536
DOI:	10.1109/ACCESS.2024.3465789
Popis:	Currently, GPUs face significant challenges due to limited off-chip bandwidth (BW) and memory capacity during DNN training. To address these bottlenecks, we propose a memory access-triggered near-data processing matNDP architecture that offloads memory- and communication-bound operations. With matNDP, normal memory accesses also serve as implicit NDP requests to enable NDP in a non-invasive manner without modifying core-side ISA/microarchitecture/SW, for practicality. Additionally, matNDP enables on-the-fly NDP where the data already supplied in normal memory requests for compute-bound operations are also used for NDP; thus, matNDP can overlap even dependent kernels while also reducing memory traffic. Moreover, with the overlap, memory bandwidth (BW) underutilized by GPU cores can be used by NDP units to improve performance under the same total memory BW. The matNDP units can be deployed to heterogeneous memory devices in a system. First, we deploy them near GPU’s memory controllers. Secondly, our NDP units are deployed in memory expanders connected to multiple GPUs to create an NDP-enabled memory eXpander Network (NDPXNet). It can entirely offload gradient reduction and optimizer in data-parallel training, achieving additional speedups while eliminating redundancy in memory usage and optimizer execution. Thus, we 1) enable NDP without core HW/SW changes, 2) overlap the execution of dependent layers, and 3) offload both memory- and communication-bound operations from GPUs in DNN training. Through our deep learning compiler support, NDP kernels can be generated automatically without any model code modification. Consequently, matNDP can improve training throughput by up to $2.73\times $ and reduce energy by up to 41.4%.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/5e0733ab887241b48181918dc53d2a04 Zobrazit plný text záznamu View record in DOAJ