Predicting DRAM reliability in the field with machine learning
Autor: | Dorothea Wiesmann, Jácint Szabó, John J. Bird, Ioana Giurgiu |
---|---|
Rok vydání: | 2017 |
Předmět: |
0301 basic medicine
Hardware_MEMORYSTRUCTURES Event (computing) Computer science Reliability (computer networking) 02 engineering and technology Missing data Ensemble learning Field (computer science) Reliability engineering 03 medical and health sciences 030104 developmental biology 020204 information systems Server 0202 electrical engineering electronic engineering information engineering False positive paradox Dram |
Zdroj: | Middleware Industry |
DOI: | 10.1145/3154448.3154451 |
Popis: | Uncorrectable errors in dynamic random access memory (DRAM) are a common form of hardware failure in server clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on analyzing DRAM reliability in large production clusters, little has been reported on the automatic prediction of such errors ahead of time. In this paper, we present a highly accurate predictive model, based on daily event logs and sensor measurements, in a large fleet of commodity servers going back to 2014. By correlating correctable errors with sensor metrics, we can use ensemble machine learning techniques to predict uncorrectable errors weeks in advance.In addition, we show how such models can be applied in the wild and consumed by customer support teams. Our goal is to minimize false positives, as healthy DRAMs should not be replaced, while accounting for common limitations, such as missing data points and rare occurences of uncorrectable errors. |
Databáze: | OpenAIRE |
Externí odkaz: |