Abstract | ||
---|---|---|
Systematic checkpointing of the machine state makes roll-back and restart of execution from a safe state possible upon detection of an error. The energy overhead of checkpointing, however, as incurred by storage and communication of the machine state, grows with the frequency of checkpointing. Amortizing this overhead becomes especially challenging, considering the growth of expected error rates as an artifact of contemporary technology scaling, as checkpointing frequency tends to increase with increasing error rates. At the same time, due to imbalances in technology scaling, recomputing data can become more energy efficient than storing and retrieving precomputed data. Recomputation of data (which otherwise would be read from a checkpoint) can reduce the frequency of checkpointing along with the data size to be checkpointed, and thereby mitigate checkpointing overhead. This paper quantitatively characterizes a recomputation-enabled checkpointing framework which can reduce the storage overhead of checkpointing by up to 23.91%; the performance overhead by up to 11.92%, and the energy overhead by up to 12.53%, respectively. |
Year | Venue | Field |
---|---|---|
2017 | arXiv: Distributed, Parallel, and Cluster Computing | Technology scaling,Computer science,Efficient energy use,Amortizing loan,Distributed computing |
DocType | Volume | Citations |
Journal | abs/1710.04685 | 0 |
PageRank | References | Authors |
0.34 | 10 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ismail Akturk | 1 | 32 | 6.56 |
Ulya R. Karpuzcu | 2 | 277 | 22.27 |