Evaluating cooperative checkpointing for supercomputing systems - Citegraph

Paper Info

Title
Evaluating cooperative checkpointing for supercomputing systems

Abstract
Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, riskbased checkpointing with event prediction accuracy as low as 10% is able to significantly improve system utilization and reduce average bounded slowdown by a factor of 9, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance benefits in the face of large checkpoint overheads.

Year	DOI	Venue
2006	10.1109/IPDPS.2006.1639693	Rhodes Island
Keywords	Field	DocType
network topology,application software,computer science,system performance	Heuristic,Supercomputer,Computer science,Parallel computing,Exploit,Network topology,Application software,Bounded function,Distributed computing,Overhead (business)	Conference
ISBN	Citations	PageRank
1-4244-0054-6	7	0.59
References	Authors
14	2

Authors (2 rows)

Cited by (7 rows)

References (14 rows)

Name	Order	Citations	PageRank
Adam J. Oliner	1	715	51.10
Ramendra K. Sahoo	2	633	56.73

1