Title
Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability.
Abstract
Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present mathematical models of checkpointing systems to quantify their reliability and availability. We perform trade-off analysis with respect to resource costs and reliability. Then, we explore the optimal checkpoint placement for checkpointing systems to maximize system availability. Finally, in a rigorous manner, we comparatively analyze the behavior of redundant systems where replication and repair mechanisms are employed. We postulate that the proposed models can aid system designers, who can instantiate our models to assess and quantify the availability and reliability of systems of interest.
Year
DOI
Venue
2018
10.1109/HiPC.2018.00029
HiPC
Keywords
Field
DocType
Checkpointing,Mathematical model,Computational modeling,Maintenance engineering,Markov processes,Redundancy
Markov process,Computer science,Redundancy (engineering),Mathematical model,Maintenance engineering,Distributed computing
Conference
ISSN
ISBN
Citations 
1094-7256
978-1-5386-8386-6
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Omer Subasi1416.34
Ramakrishna Tipireddy2113.07
Sriram Krishnamoorthy3120286.68