Abstract | ||
---|---|---|
As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++'s parallel objects for checkpointing. In this paper, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a one million atom molecular dynamics simulation on up to 64K cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart time was measured to be less than 0.15 seconds on 64K cores. |
Year | DOI | Venue |
---|---|---|
2012 | 10.1109/DSNW.2012.6264677 | DSN Workshops |
Keywords | Field | DocType |
parallel processing,application program interfaces,checkpointing,fault tolerant computing,mainframes,mpi communication layer,exascale,very large scale supercomputers,double in-memory checkpointing scheme,restart scheme,message passing,checkpoint-based fault tolerance methods,parallel application,optimization,fault tolerance,protocols | Computer science,Parallel computing,Parallel processing,Communication layer,Real-time computing,Fault tolerance,Message passing,Scalability,Distributed computing | Conference |
ISBN | Citations | PageRank |
978-1-4673-2265-2 | 43 | 1.29 |
References | Authors | |
11 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Gengbin Zheng | 1 | 829 | 55.03 |
Xiang Ni | 2 | 141 | 6.58 |
Laxmikant V. Kale | 3 | 2871 | 248.18 |