Title | ||
---|---|---|
Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales. |
Abstract | ||
---|---|---|
Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online ... |
Year | DOI | Venue |
---|---|---|
2017 | 10.1109/TPDS.2017.2696538 | IEEE Transactions on Parallel and Distributed Systems |
Keywords | Field | DocType |
Computational modeling,Delays,Protocols,Resilience,Fault tolerance,Fault tolerant systems,Hardware | Masking (art),Computer science,Stencil,Parallel processing,Stencil code,Failure rate,Real-time computing,Fault tolerance,Scalability,Computation,Distributed computing | Journal |
Volume | Issue | ISSN |
28 | 10 | 1045-9219 |
Citations | PageRank | References |
4 | 0.39 | 28 |
Authors | ||
7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Marc Gamell | 1 | 92 | 5.70 |
Keita Teranishi | 2 | 49 | 6.30 |
Jackson Mayo | 3 | 43 | 7.97 |
Hemanth Kolla | 4 | 250 | 17.13 |
Michael A. Heroux | 5 | 974 | 69.20 |
Jacqueline H Chen | 6 | 181 | 11.19 |
Manish Parashar | 7 | 3876 | 343.30 |