Scalable and Fault Tolerant Failure Detection and Consensus - Citegraph

Paper Info

Title
Scalable and Fault Tolerant Failure Detection and Consensus

Abstract
Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.

Year	DOI	Venue
2015	10.1145/2802658.2802660	EuroMPI
Field	DocType	Citations
Synchronization,Computer science,Gossip,Fault tolerance,Bandwidth (signal processing),Gossip protocol,Computing systems,Scalability,Distributed computing,Chandra–Toueg consensus algorithm	Conference	8
PageRank	References	Authors
0.47	18	4

Authors (4 rows)

Cited by (8 rows)

References (18 rows)

Name	Order	Citations	PageRank
Amogh Katti	1	12	1.88
Giuseppe Di Fatta	2	529	39.23
Thomas Naughton	3	10	1.53
Christian Engelmann	4	8	0.47

1