Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes
Autor: | Patel, Minesh |
---|---|
Přispěvatelé: | Mutlu, Onur, Erez, Mattan, Qureshi, Moinuddin, Sridharan, Vilas, Weis, Christian |
Jazyk: | angličtina |
Rok vydání: | 2021 |
Předmět: |
Electric engineering
Hardware_MEMORYSTRUCTURES ddc:621.3 Error Profiling Memory Scaling Fault Tolerance Memory Reliability Error Correction Memory Systems DRAM Error Characterization Data processing computer science Memory Errors Computer Engineering On-Die ECC ECC Memory Repair System Reliability Simulation ddc:004 |
DOI: | 10.3929/ethz-b-000542542 |
Popis: | Improvements in main memory storage density are primarily driven by process technology shrinkage (i.e., technology scaling), which negatively impacts reliability by exacerbating various circuit-level error mechanisms. To compensate for growing error rates, both memory manufacturers and consumers develop and incorporate error-mitigation mechanisms that improve manufacturing yield and allow system designers to meet reliability targets. Developing effective error mitigation techniques requires understanding the errors' characteristics (e.g., worst-case behavior, statistical properties). Unfortunately, we observe that proprietary on-die Error-Correcting Codes (ECC) used in modern memory chips introduce new challenges to efficient error mitigation by obfuscating CPU-visible error characteristics in an unpredictable, ECC-dependent manner. In this dissertation, we experimentally study memory errors, examine how on-die ECC obfuscates their statistical characteristics, and develop new testing techniques to overcome the obfuscation through four key steps. First, we experimentally study DRAM data-retention error characteristics to understand the challenges inherent in understanding and mitigating memory errors that are related to technology scaling. Second, we study how on-die ECC affects these characteristics to develop Error Inference (EIN), a new statistical inference methodology for inferring key details of the on-die ECC mechanism and the raw errors that it obfuscates. Third, we examine the on-die ECC mechanism in detail to understand exactly how on-die ECC obfuscates raw bit error patterns. Using this knowledge, we introduce Bit Exact ECC Recovery (BEER), a new testing methodology that exploits uncorrectable error patterns to (1) reverse-engineer the exact on-die ECC implementation used in a given memory chip and (2) identify the bit-exact locations of the raw bit errors responsible for a set of errors that are observed after on-die ECC correction. Fourth, we study how on-die ECC impacts error profiling and show that on-die ECC introduces three key challenges that negatively impact profiling practicality and effectiveness. To overcome these challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new error profiling strategy that uses simple modifications to the on-die ECC mechanism to quickly and effectively identify bits at risk of error. Finally, we conclude by discussing the critical need for transparency in DRAM reliability characteristics in order to enable DRAM consumers to better understand and adapt commodity DRAM chips to their system-specific needs. This dissertation builds a detailed understanding of how on-die ECC obfuscates the statistical properties of main memory error mechanisms using a combination of real-chip experiments and statistical analyses. Our results show that the error characteristics that on-die ECC obfuscates can be recovered using new memory testing techniques that exploit the interaction between on-die ECC and the statistical characteristics of memory error mechanisms to expose physical cell behavior. We hope and believe that the analysis, techniques, and results we present in this dissertation will enable the community to better understand and tackle current and future reliability challenges as well as adapt commodity memory to new advantageous applications. |
Databáze: | OpenAIRE |
Externí odkaz: |