Title
Epidemic Fault Tolerance for Extreme-Scale Parallel Computing.
Abstract
Process failure rate in the next generation of high performance computing systems is expected to be very high. MPI Forum is working on providing semantics and support for fault tolerance. Run-Through Stabilization, User-Level Failure Mitigation and Process Recovery proposals are the resulting endeavors. Run-Through Stabilization/User Level Failure Mitigation proposals require a fault tolerant failure detection and consensus algorithm to inform the application of failures so that it can employ Algorithm Based Fault Tolerance for quicker recovery and continued execution. This paper discusses the proposals in short, the failure detectors available in the literature and their unsuitability for realizing fault tolerance in MPI. It then outlines an inherently fault-tolerant and scalable Epidemic or Gossip-based approach for failure detection and consensus. Some simulations and an initial experimental analysis are presented, which indicate that this is a promising research direction.
Year
DOI
Venue
2015
10.1007/978-3-319-23237-9_18
IDCS
Keywords
DocType
Volume
Fault tolerance, Message Passing Interface (MPI), Failure detection, Epidemic protocols, Gossip-based protocols
Conference
9258
ISSN
Citations 
PageRank 
0302-9743
0
0.34
References 
Authors
10
2
Name
Order
Citations
PageRank
Amogh Katti1121.88
Giuseppe Di Fatta252939.23