Title
A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes
Abstract
Massively parallel computing systems are being built with thousands of nodes. Because of the high number of components, it is critical to keep these systems running even in the presence of failures. Interconnection networks play a key-role in these systems, and this paper proposes a fault-tolerant routing methodology for use in such networks. The methodology supports any minimal routing function (including fully adaptive routing), does not degrade performance in the absence of faults, does not disable any healthy node, and is easy to implement both in meshes and tori. In order to avoid network failures, the methodology uses a simple mechanism: for some source-destination pairs, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network). The methodology is shown to tolerate a large number of faults (e.g., five/nine faults when using two/three intermediate nodes in a 3D torus). Furthermore, the methodology offers a gracious performance degradation: in an 8 x 8 x 8 torus network with 14 faults the throughput is only decreased by 6.49%.
Year
DOI
Venue
2004
10.1007/978-3-540-30141-7_49
LECTURE NOTES IN COMPUTER SCIENCE
Keywords
Field
DocType
fault-tolerance,direct networks,adaptive routing,virtual channels,bubble flow control
Polygon mesh,Grid network,Massively parallel,Computer science,Parallel algorithm,Network packet,Computer network,Fault tolerance,Throughput,Interconnection,Distributed computing
Conference
Volume
ISSN
Citations 
3222
0302-9743
7
PageRank 
References 
Authors
0.49
14
8
Name
Order
Citations
PageRank
Nils Agne Nordbotten1905.78
María Engracia Gómez214917.48
Jose Flich3684.49
Pedro López423316.39
Antonio Robles548130.40
Tor Skeie6110374.67
Olav Lysne779754.53
José Duato83481294.85