Title
Analyzing Reliability of Memory Sub-systems with Double-Chipkill Detect/Correct
Abstract
Chip kill correct is an advanced type of error correction used in memory sub-systems. Existing analytical approaches for modeling the reliability of memory sub-systems with chipkill correct are limited to those with chip kill-correct solutions that guarantee correction of errors in a single DRAM device. However, stronger chip kill correct solutions that are capable of guaranteeing the detection and even correction of errors in up to two DRAM devices have become common in existing HPC systems. Analytical reliability models are needed for such memory subsystems. This paper proposes analytical models for the reliability of double-chipkill detect and/or correct. Validation against Monte Carlo simulations shows that the output of our analytical models are within 3.9% of Monte Carlo simulations, on average. We used the analytical models to study various aspects of the reliability of memory sub-systems protected by double-chip kill detect and/or correct. Our studies provide several insights into the dependence of reliability of these systems on scale, device fault rate, memory organization, and memory-scrubbing policy.
Year
DOI
Venue
2013
10.1109/PRDC.2013.18
PRDC
Keywords
Field
DocType
memory subsystems,monte carlo simulation,memory sub-system reliability analysis,memory organization,memory-scrubbing policy,integrated circuit reliability,device fault rate,analytical reliability model,monte carlo simulations,error detection codes,correct solution,memory errors,analytical approach,double chip kill correct solutions,chip kill-correct solution,error correction codes,single dram device,dram chips,double-chipkill detect,analytical model,monte carlo methods,systems on scale,double-chip kill detect reliability,reliability,analyzing reliability,error correcting codes,memory sub-systems,modeling,analytical reliability models,chipkill correct,error correction,hpc systems
Dram,Monte Carlo method,Computer science,Fault rate,Error detection and correction,Real-time computing,Chip,Memory organisation,Memory errors,Reliability model
Conference
Citations 
PageRank 
References 
4
0.46
6
Authors
5
Name
Order
Citations
PageRank
Xun Jian1666.08
Nathan DeBardeleben249031.71
Sean Blanchard319013.20
Vilas Sridharan451223.45
Rakesh Kumar51923157.44