Title
Predicting DRAM reliability in the field with machine learning.
Abstract
Uncorrectable errors in dynamic random access memory (DRAM) are a common form of hardware failure in server clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on analyzing DRAM reliability in large production clusters, little has been reported on the automatic prediction of such errors ahead of time. In this paper, we present a highly accurate predictive model, based on daily event logs and sensor measurements, in a large fleet of commodity servers going back to 2014. By correlating correctable errors with sensor metrics, we can use ensemble machine learning techniques to predict uncorrectable errors weeks in advance. In addition, we show how such models can be applied in the wild and consumed by customer support teams. Our goal is to minimize false positives, as healthy DRAMs should not be replaced, while accounting for common limitations, such as missing data points and rare occurences of uncorrectable errors.
Year
DOI
Venue
2017
10.1145/3154448.3154451
Middleware '17: 18th International Middleware Conference Las Vegas Nevada December, 2017
Keywords
DocType
ISBN
Memory systems,Reliability,Failure prediction,Ensemble machine learning
Conference
978-1-4503-5200-0
Citations 
PageRank 
References 
5
0.44
0
Authors
4
Name
Order
Citations
PageRank
Ioana Giurgiu121314.09
Jacint Szabo250.44
Dorothea Wiesmann3414.30
John Bird450.78