Title | ||
---|---|---|
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++ |
Abstract | ||
---|---|---|
As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance. |
Year | DOI | Venue |
---|---|---|
2006 | 10.1145/1131322.1131340 | Operating Systems Review |
Keywords | Field | DocType |
application developer,automatic checkpoint-based fault tolerance,effective approach,significant challenge,memory-based checkpointing fault tolerance,checkpoint-based fault tolerance method,different number,adaptive mpi,performance evaluation,significant additional code,high performance clusters multiplies,entire parallel application,application development,fault tolerant | Computer science,Parallel computing,Real-time computing,Fault tolerance,Message Passing Interface,Message passing,Distributed computing,Scalability | Journal |
Volume | Issue | Citations |
40 | 2 | 14 |
PageRank | References | Authors |
0.84 | 13 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Gengbin Zheng | 1 | 829 | 55.03 |
Chao Huang | 2 | 154 | 10.09 |
Laxmikant V. Kale | 3 | 2871 | 248.18 |