Abstract | ||
---|---|---|
Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing. |
Year | DOI | Venue |
---|---|---|
2006 | 10.1145/1183401.1183406 | I4CS |
Keywords | Field | DocType |
qos guarantee,robust approach,large-scale systems reliability,simulation-based experimental analysis,robust system,failure distribution,fault-aware job scheduling,periodic checkpointing,application-initiated checkpointing mechanism,cooperative checkpointing,theoretical prediction,reliability technique,parallel computer,experimental analysis,high performance computing,supercomputing,job scheduling,parallel computing | Supercomputer,Computer science,Parallel computing,Quality of service,Robustness (computer science),Real-time computing,Job scheduler,Probabilistic logic,Periodic graph (geometry),Distributed computing | Conference |
ISBN | Citations | PageRank |
1-59593-282-8 | 31 | 2.03 |
References | Authors | |
22 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Adam J. Oliner | 1 | 715 | 51.10 |
Larry Rudolph | 2 | 168 | 15.54 |
Ramendra K. Sahoo | 3 | 633 | 56.73 |