Replication is more efficient than you think - Citegraph

Paper Info

Title
Replication is more efficient than you think

Abstract
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period [EQUATION] à la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period [EQUATION] for this strategy, which is much larger than [EQUATION], thereby decreasing I/O pressure. We show through simulations that using [EQUATION] and the restart strategy, instead of [EQUATION] and the usual no-restart strategy, significantly decreases the overhead induced by replication.

Year	DOI	Venue
2019	10.1145/3295500.3356171	Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
DocType	ISBN	Citations
Conference	978-1-4503-6229-0	1
PageRank	References	Authors
0.36	0	4

Authors (4 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Anne Benoit	1	342	33.74
Thomas Herault	2	507	40.06
Valentin Le Fèvre	3	7	2.13
Yves Robert	4	842	70.03

1