Title
Cooperative checkpointing: a robust approach to large-scale systems reliability
Abstract
Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing.
Year
DOI
Venue
2006
10.1145/1183401.1183406
I4CS
Keywords
Field
DocType
qos guarantee,robust approach,large-scale systems reliability,simulation-based experimental analysis,robust system,failure distribution,fault-aware job scheduling,periodic checkpointing,application-initiated checkpointing mechanism,cooperative checkpointing,theoretical prediction,reliability technique,parallel computer,experimental analysis,high performance computing,supercomputing,job scheduling,parallel computing
Supercomputer,Computer science,Parallel computing,Quality of service,Robustness (computer science),Real-time computing,Job scheduler,Probabilistic logic,Periodic graph (geometry),Distributed computing
Conference
ISBN
Citations 
PageRank 
1-59593-282-8
31
2.03
References 
Authors
22
3
Name
Order
Citations
PageRank
Adam J. Oliner171551.10
Larry Rudolph216815.54
Ramendra K. Sahoo363356.73