Title | ||
---|---|---|
Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability. |
Abstract | ||
---|---|---|
Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present mathematical models of checkpointing systems to quantify their reliability and availability. We perform trade-off analysis with respect to resource costs and reliability. Then, we explore the optimal checkpoint placement for checkpointing systems to maximize system availability. Finally, in a rigorous manner, we comparatively analyze the behavior of redundant systems where replication and repair mechanisms are employed. We postulate that the proposed models can aid system designers, who can instantiate our models to assess and quantify the availability and reliability of systems of interest. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/HiPC.2018.00029 | HiPC |
Keywords | Field | DocType |
Checkpointing,Mathematical model,Computational modeling,Maintenance engineering,Markov processes,Redundancy | Markov process,Computer science,Redundancy (engineering),Mathematical model,Maintenance engineering,Distributed computing | Conference |
ISSN | ISBN | Citations |
1094-7256 | 978-1-5386-8386-6 | 0 |
PageRank | References | Authors |
0.34 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Omer Subasi | 1 | 41 | 6.34 |
Ramakrishna Tipireddy | 2 | 11 | 3.07 |
Sriram Krishnamoorthy | 3 | 1202 | 86.68 |