Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems - Citegraph

Paper Info

Title
Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems

Abstract
To maintain strong reliability, memory manufacturers label server memories at much slower data rates than the highest data rates at which they can still operate correctly for most (e.g., 99.999%+ of) accesses; we refer to the gap between these two data rates as memory frequency margin. While many prior works have studied memory latency margins in a different context of consumer memories, none has publicly studied memory frequency margin (either for consumer or server memories).To close this knowledge gap in the public domain, we perform the first public study to characterize frequency margins in commodity server memory modules. Through our large-scale study, we find that under standard voltage and cooling, they can operate 27% faster, on average, without error(s) for 99.999%+ of accesses even at high temperatures.The current practice of conservatively operating server memory is far from ideal; it slows down 99.999%+ of accesses to benefit the <0.001% of accesses that would be erroneous at a faster data rate. An ideal system should only pay this reliability tax for the <0.001% of accesses that actually need it.Towards unleashing ideal performance, our second contribution is performing the first exploration on exploiting server memory frequency margin to maximize performance. We focus on High-Performance Computing (HPC) systems, where performance is paramount. We propose exploiting HPC systems’ abundant free memory in the common case to store copies of every data block and operate the copies unreliably fast to speedup common-case accesses; we use the safely-operated original blocks for recovery when the unsafely-operated copies become corrupted. We refer to our idea as Heterogeneously-accessed Dual Module Redundancy (Hetero-DMR).Hetero-DMR improves node-level performance by 18%, on average across two CPU memory hierarchies and six HPC benchmark suites, while weighted by different frequency margins and different levels of memory utilization. We also use a real system to emulate the speedup of Hetero-DMR over a conventional system; it closely matches simulation. Our system-wide simulations show applying Hetero-DMR to an HPC system provides 1.4x average speedup on job turnaround time. To facilitate adoption, Hetero-DMR also rigorously preserves system reliability and works for commodity DIMMs and CPU-memory interfaces.

Year	DOI	Venue
2021	10.1109/ISCA52012.2021.00064	2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
Keywords	DocType	ISSN
Memory Frequency Margin,Memory System,Fault Tolerance,Reliability,Availability,HPC	Conference	1063-6897
ISBN	Citations	PageRank
978-1-6654-3334-1	1	0.35
References	Authors
0	6

Authors (6 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Da Zhang	1	12	2.25
Gagandeep Panwar	2	1	0.69
Jagadish Kotra	3	60	8.00
Nathan DeBardeleben	4	490	31.71
Sean Blanchard	5	190	13.20
Xun Jian	6	66	6.08

1