Title
A scalable asynchronous replication-based strategy for fault tolerant MPI applications
Abstract
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs.We demonstrate the low overhead of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques.
Year
DOI
Venue
2007
10.1007/978-3-540-77220-0_26
HiPC
Keywords
Field
DocType
replication scheme,central storage,checkpointing technique,mpi checkpointing facility,centralized storage,replication-based strategy,asynchronous replication,scalable asynchronous,network storage,fault tolerant mpi application,lower overhead,low overhead,fault-tolerant mpi,mean time to failure,fault tolerant
Asynchronous communication,File system,Network storage,Computer science,Parallel computing,Fault tolerance,Computation,Scalability,Distributed computing
Conference
Volume
ISSN
ISBN
4873
0302-9743
3-540-77219-7
Citations 
PageRank 
References 
5
0.44
17
Authors
2
Name
Order
Citations
PageRank
John Paul Walters126720.45
Vipin Chaudhary283883.24