Abstract | ||
---|---|---|
Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing systems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance computing systems. Our goal is to understand the factors that influence faults in production settings. We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study the impact of altitude and rack placement on SRAM faults, finding that, as expected, altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate. |
Year | DOI | Venue |
---|---|---|
2013 | 10.1145/2503210.2503257 | High Performance Computing, Networking, Storage and Analysis |
Keywords | Field | DocType |
dram device,higher fault rate,positional effect,high-performance computing system,computing system,feng shui,supercomputer memory,large high-performance computing system,dram lifetime,sram fault,substantial impact,dram vendor,fault rate,memory,phase change | Dram,Supercomputer,Computer science,Parallel computing,Fault rate,Universal memory,Static random-access memory,Data center,CAS latency,Computing systems,Embedded system | Conference |
ISSN | ISBN | Citations |
2167-4329 | 978-1-4503-2378-9 | 69 |
PageRank | References | Authors |
1.91 | 13 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Vilas Sridharan | 1 | 512 | 23.45 |
Jon Stearley | 2 | 651 | 24.52 |
Nathan DeBardeleben | 3 | 490 | 31.71 |
Sean Blanchard | 4 | 190 | 13.20 |
Sudhanva Gurumurthi | 5 | 1232 | 78.23 |