Abstract | ||
---|---|---|
Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC2-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/TPDS.2020.3015615 | IEEE Transactions on Parallel and Distributed Systems |
Keywords | DocType | Volume |
Fault tolerance,checkpoint-restart libraries,MPI,checkpoint scalability | Journal | 32 |
Issue | ISSN | Citations |
2 | 1045-9219 | 0 |
PageRank | References | Authors |
0.34 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Alvaro Wong | 1 | 37 | 8.11 |
Elisa Heymann | 2 | 108 | 13.21 |
Dolores Rexachs | 3 | 195 | 43.20 |
Emilio Luque | 4 | 1097 | 176.18 |