Title
Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales.
Abstract
Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online ...
Year
DOI
Venue
2017
10.1109/TPDS.2017.2696538
IEEE Transactions on Parallel and Distributed Systems
Keywords
Field
DocType
Computational modeling,Delays,Protocols,Resilience,Fault tolerance,Fault tolerant systems,Hardware
Masking (art),Computer science,Stencil,Parallel processing,Stencil code,Failure rate,Real-time computing,Fault tolerance,Scalability,Computation,Distributed computing
Journal
Volume
Issue
ISSN
28
10
1045-9219
Citations 
PageRank 
References 
4
0.39
28
Authors
7
Name
Order
Citations
PageRank
Marc Gamell1925.70
Keita Teranishi2496.30
Jackson Mayo3437.97
Hemanth Kolla425017.13
Michael A. Heroux597469.20
Jacqueline H Chen618111.19
Manish Parashar73876343.30