Abstract | ||
---|---|---|
Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the overheads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1145/2749246.2749260 | High-Performance Distributed Computing |
Field | DocType | Citations |
Psychological resilience,Computer science,Stencil,Parallel computing,Real-time computing,Fault tolerance,Distributed computing,Overhead (business) | Conference | 5 |
PageRank | References | Authors |
0.42 | 12 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Marc Gamell | 1 | 92 | 5.70 |
Keita Teranishi | 2 | 49 | 6.30 |
Michael A. Heroux | 3 | 974 | 69.20 |
Jackson Mayo | 4 | 43 | 7.97 |
Hemanth Kolla | 5 | 250 | 17.13 |
Jacqueline Chen | 6 | 240 | 13.69 |
Manish Parashar | 7 | 3876 | 343.30 |