A scalable double in-memory checkpoint and restart scheme towards exascale - Citegraph

Paper Info

Title
A scalable double in-memory checkpoint and restart scheme towards exascale

Abstract
As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++'s parallel objects for checkpointing. In this paper, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a one million atom molecular dynamics simulation on up to 64K cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart time was measured to be less than 0.15 seconds on 64K cores.

Year	DOI	Venue
2012	10.1109/DSNW.2012.6264677	DSN Workshops
Keywords	Field	DocType
parallel processing,application program interfaces,checkpointing,fault tolerant computing,mainframes,mpi communication layer,exascale,very large scale supercomputers,double in-memory checkpointing scheme,restart scheme,message passing,checkpoint-based fault tolerance methods,parallel application,optimization,fault tolerance,protocols	Computer science,Parallel computing,Parallel processing,Communication layer,Real-time computing,Fault tolerance,Message passing,Scalability,Distributed computing	Conference
ISBN	Citations	PageRank
978-1-4673-2265-2	43	1.29
References	Authors
11	3

Authors (3 rows)

Cited by (43 rows)

References (11 rows)

Name	Order	Citations	PageRank
Gengbin Zheng	1	829	55.03
Xiang Ni	2	141	6.58
Laxmikant V. Kale	3	2871	248.18

1