Title
Replication is more efficient than you think
Abstract
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period [EQUATION] à la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period [EQUATION] for this strategy, which is much larger than [EQUATION], thereby decreasing I/O pressure. We show through simulations that using [EQUATION] and the restart strategy, instead of [EQUATION] and the usual no-restart strategy, significantly decreases the overhead induced by replication.
Year
DOI
Venue
2019
10.1145/3295500.3356171
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
DocType
ISBN
Citations 
Conference
978-1-4503-6229-0
1
PageRank 
References 
Authors
0.36
0
4
Name
Order
Citations
PageRank
Anne Benoit134233.74
Thomas Herault250740.06
Valentin Le Fèvre372.13
Yves Robert484270.03