Abstract | ||
---|---|---|
Process failure rate in the next generation of high performance computing systems is expected to be very high. MPI Forum is working on providing semantics and support for fault tolerance. Run-Through Stabilization, User-Level Failure Mitigation and Process Recovery proposals are the resulting endeavors. Run-Through Stabilization/User Level Failure Mitigation proposals require a fault tolerant failure detection and consensus algorithm to inform the application of failures so that it can employ Algorithm Based Fault Tolerance for quicker recovery and continued execution. This paper discusses the proposals in short, the failure detectors available in the literature and their unsuitability for realizing fault tolerance in MPI. It then outlines an inherently fault-tolerant and scalable Epidemic or Gossip-based approach for failure detection and consensus. Some simulations and an initial experimental analysis are presented, which indicate that this is a promising research direction. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1007/978-3-319-23237-9_18 | IDCS |
Keywords | DocType | Volume |
Fault tolerance, Message Passing Interface (MPI), Failure detection, Epidemic protocols, Gossip-based protocols | Conference | 9258 |
ISSN | Citations | PageRank |
0302-9743 | 0 | 0.34 |
References | Authors | |
10 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Amogh Katti | 1 | 12 | 1.88 |
Giuseppe Di Fatta | 2 | 529 | 39.23 |