Title
Memory Errors in Modern Systems: The Good, The Bad, and The Ugly
Abstract
Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.
Year
DOI
Venue
2015
10.1145/2694344.2694348
ASPLOS
Keywords
Field
DocType
field studies,large-scale systems,reliability,reliability, testing, and fault-tolerance
Psychological resilience,Dram,Computer science,Parallel computing,Real-time computing,Static random-access memory,Data center,Memory errors,Computing systems
Conference
Volume
Issue
ISSN
43
1
0163-5964
Citations 
PageRank 
References 
75
1.67
19
Authors
7
Name
Order
Citations
PageRank
Vilas Sridharan151223.45
Nathan DeBardeleben249031.71
Sean Blanchard319013.20
Kurt Ferreira463940.78
Jon Stearley565124.52
John Shalf62353211.77
Sudhanva Gurumurthi7123278.23