Title
When is the Right Time to Start the Fault Tolerance Protection?
Abstract
In High Performance Computing, Fault Tolerance (FT) becomes a primary concern due to the constant growing and continuous aging of hardware components, which rise failures probability. Failures produce performance degradation to the environment and affect significantly users expected execution time. Rollback-Recovery protocols represent a fundamental component to protect and restore users parallel application execution, although this protection comes with an overhead. This paper proposes a First Protection Point model, which determines the starting point to introduce FT protection gaining benefits in terms of total execution time including failures. A characterization of Rollback-Recovery protocols applied on parallel applications is performed, to obtain key factors for the model design. This model can help users determine which checkpoints can be removed from the application execution when they are used for FT protection purposes, reducing the overhead and at the same time keeping high availability. An analytic model evaluation is developed to show the inflexion point where FT protection starts to provide benefits for users. Finally, three experimental environments are setup, using two private clusters and a public cluster configured in a well-known cloud Amazon EC2. A coordinated checkpoint facility is applied on NAS benchmark applications such as: CG, BT and LU to evaluate the proposed model, obtaining overhead impact reduction for provided Fault Tolerance.
Year
DOI
Venue
2017
10.1109/HPCS.2017.70
2017 International Conference on High Performance Computing & Simulation (HPCS)
Keywords
Field
DocType
High Performance Computing (HPC),Fault Tolerance,Checkpoints,Protection Models
Supercomputer,Computer science,Fault tolerance,Execution time,Analytic model,High availability,Reliability engineering,Cloud computing,Embedded system
Conference
ISBN
Citations 
PageRank 
978-1-5386-3251-2
0
0.34
References 
Authors
11
3
Name
Order
Citations
PageRank
Jorge Villamayor101.35
Dolores Rexachs219543.20
Emilio Luque31097176.18