Title
Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints
Abstract
Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC2-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints.
Year
DOI
Venue
2021
10.1109/TPDS.2020.3015615
IEEE Transactions on Parallel and Distributed Systems
Keywords
DocType
Volume
Fault tolerance,checkpoint-restart libraries,MPI,checkpoint scalability
Journal
32
Issue
ISSN
Citations 
2
1045-9219
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Alvaro Wong1378.11
Elisa Heymann210813.21
Dolores Rexachs319543.20
Emilio Luque41097176.18