Title
Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems
Abstract
To maintain strong reliability, memory manufacturers label server memories at much slower data rates than the highest data rates at which they can still operate correctly for most (e.g., 99.999%+ of) accesses; we refer to the gap between these two data rates as memory frequency margin. While many prior works have studied memory latency margins in a different context of consumer memories, none has publicly studied memory frequency margin (either for consumer or server memories).To close this knowledge gap in the public domain, we perform the first public study to characterize frequency margins in commodity server memory modules. Through our large-scale study, we find that under standard voltage and cooling, they can operate 27% faster, on average, without error(s) for 99.999%+ of accesses even at high temperatures.The current practice of conservatively operating server memory is far from ideal; it slows down 99.999%+ of accesses to benefit the <0.001% of accesses that would be erroneous at a faster data rate. An ideal system should only pay this reliability tax for the <0.001% of accesses that actually need it.Towards unleashing ideal performance, our second contribution is performing the first exploration on exploiting server memory frequency margin to maximize performance. We focus on High-Performance Computing (HPC) systems, where performance is paramount. We propose exploiting HPC systems’ abundant free memory in the common case to store copies of every data block and operate the copies unreliably fast to speedup common-case accesses; we use the safely-operated original blocks for recovery when the unsafely-operated copies become corrupted. We refer to our idea as Heterogeneously-accessed Dual Module Redundancy (Hetero-DMR).Hetero-DMR improves node-level performance by 18%, on average across two CPU memory hierarchies and six HPC benchmark suites, while weighted by different frequency margins and different levels of memory utilization. We also use a real system to emulate the speedup of Hetero-DMR over a conventional system; it closely matches simulation. Our system-wide simulations show applying Hetero-DMR to an HPC system provides 1.4x average speedup on job turnaround time. To facilitate adoption, Hetero-DMR also rigorously preserves system reliability and works for commodity DIMMs and CPU-memory interfaces.
Year
DOI
Venue
2021
10.1109/ISCA52012.2021.00064
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
Keywords
DocType
ISSN
Memory Frequency Margin,Memory System,Fault Tolerance,Reliability,Availability,HPC
Conference
1063-6897
ISBN
Citations 
PageRank 
978-1-6654-3334-1
1
0.35
References 
Authors
0
6
Name
Order
Citations
PageRank
Da Zhang1122.25
Gagandeep Panwar210.69
Jagadish Kotra3608.00
Nathan DeBardeleben449031.71
Sean Blanchard519013.20
Xun Jian6666.08