Title
ACR: Amnesic Checkpointing and Recovery
Abstract
Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Considering the growth of expected error rates, amortizing this overhead becomes especially challenging, as checkpointing frequency tends to increase with increasing error rates. Based on the observation that due to imbalanced technology scaling, recomputing a data value can be more energy efficient than retrieving (i.e., loading) a stored copy, this paper explores how recomputation of data values (which otherwise would be read from a checkpoint from memory or secondary storage) can reduce the machine state to be checkpointed, and thereby, the checkpointing overhead. Even in a relatively small scale system, recomputation-based checkpointing can reduce the storage overhead by up to 23.91%; time overhead, by 11.92%; and energy overhead, by 12.53%, respectively.
Year
DOI
Venue
2020
10.1109/HPCA47549.2020.00013
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Keywords
DocType
ISSN
checkpointing,recovery,recomputation
Conference
1530-0897
ISBN
Citations 
PageRank 
978-1-7281-6150-1
1
0.43
References 
Authors
21
2
Name
Order
Citations
PageRank
Ismail Akturk1326.56
Ulya R. Karpuzcu227722.27