Silent-PIM: Realizing the Processing-in-Memory Computing With Standard Memory Requests

Autor: Yoonah Paik, Seok Young Kim, Il Park, Chang Hyun Kim, Kiyong Kwon, Seon Wook Kim, Wonjun Lee
Rok vydání: 2022
Předmět:
Zdroj: IEEE Transactions on Parallel and Distributed Systems. 33:251-262
ISSN: 2161-9883
1045-9219
DOI: 10.1109/tpds.2021.3065365
Popis: The Deep Neural Network (DNN), Recurrent Neural Network (RNN) applications, rapidly becoming attractive to the market, process a large amount of low-locality data; thus, the memory bandwidth limits their peak performance. Therefore, many data centers actively adapt high-bandwidth memory like HBM2/HBM2E to resolve the problem. However, this approach would not provide a complete solution since it still transfers the data from the memory to the computing unit. Thus, processing-in-memory (PIM), which performs the computation inside memory, has attracted attention. However, most previous methods require the modification or the extension of core pipelines and memory system components like memory controllers, making the practical implementation of PIM very challenging and expensive in development. In this article, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications’ memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM’s offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three Long Short-Term Memory models (LSTM) kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For $(p \times 512) \times (512 \times 2048)$ ( p × 512 ) × ( 512 × 2048 ) matrix multiplication with a batch size $p$ p varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, $p=1$ p = 1 , which was the case without having any data reuse. At $p=128$ p = 128 , the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at $(p \times 2048)$ ( p × 2048 ) element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM’s EDP performance was superior to the others in all the cases having no data reuse.
Databáze: OpenAIRE