Abstract | ||
---|---|---|
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period [EQUATION] à la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period [EQUATION] for this strategy, which is much larger than [EQUATION], thereby decreasing I/O pressure. We show through simulations that using [EQUATION] and the restart strategy, instead of [EQUATION] and the usual no-restart strategy, significantly decreases the overhead induced by replication.
|
Year | DOI | Venue |
---|---|---|
2019 | 10.1145/3295500.3356171 | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis |
DocType | ISBN | Citations |
Conference | 978-1-4503-6229-0 | 1 |
PageRank | References | Authors |
0.36 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Anne Benoit | 1 | 342 | 33.74 |
Thomas Herault | 2 | 507 | 40.06 |
Valentin Le Fèvre | 3 | 7 | 2.13 |
Yves Robert | 4 | 842 | 70.03 |