Title
Exploring Failure Recovery for Stencil-based Applications at Extreme Scales
Abstract
Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the overheads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.
Year
DOI
Venue
2015
10.1145/2749246.2749260
High-Performance Distributed Computing
Field
DocType
Citations 
Psychological resilience,Computer science,Stencil,Parallel computing,Real-time computing,Fault tolerance,Distributed computing,Overhead (business)
Conference
5
PageRank 
References 
Authors
0.42
12
7
Name
Order
Citations
PageRank
Marc Gamell1925.70
Keita Teranishi2496.30
Michael A. Heroux397469.20
Jackson Mayo4437.97
Hemanth Kolla525017.13
Jacqueline Chen624013.69
Manish Parashar73876343.30