Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs

Autor:	David Troendle, Choo Kyo-Shin, Esraa A. Gad, Byunghyun Jang
Rok vydání:	2017
Předmět:	010302 applied physics Hardware_MEMORYSTRUCTURES CPU cache Computer science Locality Thrashing 02 engineering and technology Thread (computing) Parallel computing 01 natural sciences 020202 computer hardware & architecture Instruction set 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Cache Massively parallel Execution model
Zdroj:	ISPDC
Popis:	Modern GPUs embrace on-chip cache memory to exploit the locality present in applications. However, the behavior and effect of the cache on GPUs are different from those on conventional processors due to the Single Instruction Multiple Thread (SIMT) thread execution model and resulting memory access patterns. Previous studies report that caching data can hurt the performance due to increased memory traffic and thrashing on massively parallel GPUs. We found that the massive parallel thread execution of GPUs causes significant resource access contention among threads, especially within a warp. This is due to excessive demands for memory resources that are not sufficient to support massively parallel thread execution when memory access patterns are not hardware friendly.In this paper, we propose a locality and contention aware selective caching based on memory access divergence to mitigate intra-warp resource contention in L1 data (L1D) cache on GPUs. To determine when and what to cache we use the following heuristics: first, we detect memory divergence degree (i.e., how the memory requests from a warp are grouped) of the memory instruction to determine whether the selective caching is needed. Second, we use cache index calculation to handle congested cache sets. %the case where accesses are congested into certain cache sets. Finally, we calculate locality degree to find a better victim cache line. These algorithmic selective caching is developed based on our observation that 1) divergent memory access incurs severe contention for cache hardware resources and 2) accesses are mapped to certain sets when set associativity is relatively small compared with the memory divergence degree. Experimental results by GPU architectural simulator show that our proposed selective caching improves the average performance by 2.25x over baseline and reduces L1D cache accesses by 71%. It outperforms recently published state-of-the-art GPU cache bypassing schemes.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::393e0f242fafb7fb4f83f1626376e2f1 https://doi.org/10.1109/ispdc.2017.17 Zobrazit plný text záznamu