Title
Energy Analysis and Optimization for Resilient Scalable Linear Systems
Abstract
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize costs of resilience techniques including checkpoint-restart and forward recovery for large sparse linear system solvers. In particular, we present experimental and analytical methods to analyze and quantify the time and energy costs of recovery schemes on computer clusters. We further develop and prototype performance optimization and power management strategies to improve energy efficiency. Experimental results show that recovery schemes incur different time and energy overheads and optimization techniques significantly reduce such overheads. This work suggests that resilience techniques should be adaptively adjusted to a given fault rate, system size, and power budget.
Year
DOI
Venue
2018
10.1109/CLUSTER.2018.00015
2018 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords
Field
DocType
Resilience,Energy-Efficiency,Forward Recovery,HPC
Power budget,Psychological resilience,Exascale computing,Power management,Computer science,Efficient energy use,Computer cluster,Distributed computing,Overhead (business),Scalability
Conference
ISSN
ISBN
Citations 
1552-5244
978-1-5386-8320-0
0
PageRank 
References 
Authors
0.34
26
3
Name
Order
Citations
PageRank
Zheng Miao110.68
Jon Calhoun2474.75
Ge, Rong3111978.72