Title
Physics-Informed Machine Learning for DRAM Error Modeling
Abstract
As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.
Year
DOI
Venue
2018
10.1109/DFT.2018.8602983
2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)
Keywords
Field
DocType
proactive error mitigation,historical data,relatively simple statistical models,real-time error prediction,DRAM hardware,statistical methods,DRAM spatial structure,statistical machine,Lawrence Berkeley National Laboratory,Los Alamos National Laboratory,DRAM errors,exascale workloads,hardware faults,data corruption errors,modern supercomputers,hardware failures,high performance computing facilities,DRAM error modeling,physics-informed machine learning
Dram,Decision tree,Locality,Locality of reference,Supercomputer,Computer science,Cielo,Artificial intelligence,Data Corruption,Statistical model,Machine learning
Conference
ISSN
ISBN
Citations 
1550-5774
978-1-5386-8399-6
0
PageRank 
References 
Authors
0.34
18
8
Name
Order
Citations
PageRank
Elisabeth Baseman1101.54
Nathan DeBardeleben249031.71
Sean Blanchard319013.20
Juston Moore430.77
Olena Tkachenko510.69
Kurt Ferreira663940.78
Taniya Siddiqua71196.73
Vilas Sridharan851223.45