Title
A Fault Tolerance Manager with Distributed Coordinated Checkpoints for Automatic Recovery
Abstract
Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.
Year
DOI
Venue
2017
10.1109/HPCS.2017.73
2017 International Conference on High Performance Computing & Simulation (HPCS)
Keywords
Field
DocType
Fault Tolerance,Checkpoint/Restart,Distributed Checkpointing,Automatic Recovery
Bottleneck,Computer science,Mean time to repair,Fault tolerance,Message Passing Interface,High availability,Cloud computing,Scalability,Stable storage,Distributed computing
Conference
ISBN
Citations 
PageRank 
978-1-5386-3251-2
0
0.34
References 
Authors
10
3
Name
Order
Citations
PageRank
Jorge Villamayor101.35
Dolores Rexachs219543.20
Emilio Luque31097176.18