D 2 MA

Autor:	D. Anoushe Jamshidi, Mehrzad Samadi, Scott Mahlke
Rok vydání:	2014
Předmět:	Distributed shared memory Hardware_MEMORYSTRUCTURES Physical address Memory management Flat memory model Shared memory Computer science CUDA Pinned memory Interleaved memory Parallel computing Memory map
Zdroj:	PACT
DOI:	10.1145/2628071.2628072
Popis:	To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available memory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs' shader cores, to buffer data for computation. This buffering, however, has some sources of inefficiency that hinder it from most efficiently utilizing the available memory resources. These issues stem from shader resources being used for repeated, regular address calculations, a need to shuffle data multiple times between a physically unified on-chip memory, and forcing all threads to synchronize to ensure RAW consistency based on the speed of the slowest threads. To address these inefficiencies, we propose Data-Parallel DMA, or D2MA. D2MA is a reimagination of traditional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads. D2MA de-couples address generation from the shader's computational resources, provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer. These advancements allow D2MA to achieve speedups as high as 2.29x, and reduces the average time to buffer data by 81% on average.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::045a80629d537f596b745f1828a1fb7d https://doi.org/10.1145/2628071.2628072 Zobrazit plný text záznamu