Energy Analysis and Optimization for Resilient Scalable Linear Systems - Citegraph

Paper Info

Title
Energy Analysis and Optimization for Resilient Scalable Linear Systems

Abstract
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize costs of resilience techniques including checkpoint-restart and forward recovery for large sparse linear system solvers. In particular, we present experimental and analytical methods to analyze and quantify the time and energy costs of recovery schemes on computer clusters. We further develop and prototype performance optimization and power management strategies to improve energy efficiency. Experimental results show that recovery schemes incur different time and energy overheads and optimization techniques significantly reduce such overheads. This work suggests that resilience techniques should be adaptively adjusted to a given fault rate, system size, and power budget.

Year	DOI	Venue
2018	10.1109/CLUSTER.2018.00015	2018 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords	Field	DocType
Resilience,Energy-Efficiency,Forward Recovery,HPC	Power budget,Psychological resilience,Exascale computing,Power management,Computer science,Efficient energy use,Computer cluster,Distributed computing,Overhead (business),Scalability	Conference
ISSN	ISBN	Citations
1552-5244	978-1-5386-8320-0	0
PageRank	References	Authors
0.34	26	3

Authors (3 rows)

Cited by (0 rows)

References (26 rows)

Name	Order	Citations	PageRank
Zheng Miao	1	1	0.68
Jon Calhoun	2	47	4.75
Ge, Rong	3	1119	78.72

1