Cooperative Caching for GPUs

Autor:	Vijay Nagarajan, Nigel Topham, Saumay Dublish
Jazyk:	angličtina
Rok vydání:	2016
Předmět:	010302 applied physics Hardware_MEMORYSTRUCTURES business.industry CPU cache Computer science Thrashing Ring network 02 engineering and technology 01 natural sciences 020202 computer hardware & architecture High memory Hardware and Architecture Embedded system Multithreading 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Cache Performance improvement business Critical path method Software Information Systems Computer network
Zdroj:	Dublish, S, Nagarajan, V & Topham, N 2016, ' Cooperative Caching for GPUs ', ACM Transactions on Architecture and Code Optimization, vol. 13, no. 4, 39, pp. 1-25 . https://doi.org/10.1145/3001589
DOI:	10.1145/3001589
Popis:	The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies. In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3b1a6e0dec93e95f86a2ec795bebfa19 https://www.pure.ed.ac.uk/ws/files/29959329/taco16_dublish_PURE_1.pdf Zobrazit plný text záznamu