Abstract | ||
---|---|---|
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize costs of resilience techniques including checkpoint-restart and forward recovery for large sparse linear system solvers. In particular, we present experimental and analytical methods to analyze and quantify the time and energy costs of recovery schemes on computer clusters. We further develop and prototype performance optimization and power management strategies to improve energy efficiency. Experimental results show that recovery schemes incur different time and energy overheads and optimization techniques significantly reduce such overheads. This work suggests that resilience techniques should be adaptively adjusted to a given fault rate, system size, and power budget. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/CLUSTER.2018.00015 | 2018 IEEE International Conference on Cluster Computing (CLUSTER) |
Keywords | Field | DocType |
Resilience,Energy-Efficiency,Forward Recovery,HPC | Power budget,Psychological resilience,Exascale computing,Power management,Computer science,Efficient energy use,Computer cluster,Distributed computing,Overhead (business),Scalability | Conference |
ISSN | ISBN | Citations |
1552-5244 | 978-1-5386-8320-0 | 0 |
PageRank | References | Authors |
0.34 | 26 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Zheng Miao | 1 | 1 | 0.68 |
Jon Calhoun | 2 | 47 | 4.75 |
Ge, Rong | 3 | 1119 | 78.72 |