Detecting and mitigating data-dependent DRAM failures by exploiting current memory content

Autor:	Donghyuk Lee, Onur Mutlu, Christopher B. Wilkerson, Samira Khan, Zhe Wang, Alaa R. Alameldeen
Rok vydání:	2017
Předmět:	010302 applied physics Dynamic random-access memory Hardware_MEMORYSTRUCTURES business.industry Computer science Fault tolerance 02 engineering and technology 01 natural sciences 020202 computer hardware & architecture Refresh rate law.invention Idle law Embedded system 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Latency (engineering) business Dram
Zdroj:	MICRO
DOI:	10.1145/3123939.3123945
Popis:	DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge.In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle. Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65-74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a single-core and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.CCS CONCEPTS• Computer systems organization $\rightarrow$ Processors and memory architectures; • Hardware $\rightarrow$ Dynamic memory
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::59bc8374e31191fd1b730fa375ae06ea https://doi.org/10.1145/3123939.3123945 Zobrazit plný text záznamu