Abstract | ||
---|---|---|
Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance. |
Year | DOI | Venue |
---|---|---|
2006 | 10.1109/TC.2006.46 | IEEE Trans. Computers |
Keywords | Field | DocType |
fault tolerant computing,multiprocessor interconnection networks,network routing,parallel processing,adaptive routing,checkpoint-restart mechanism,direct networks,fault-tolerant routing methodology,interconnection network,parallel computing system,Fault tolerance,adaptive routing,bubble flow control.,direct networks,virtual channels | Multipath routing,Dynamic Source Routing,Computer science,Massively parallel,Static routing,Computer network,Real-time computing,Geographic routing,Fault model,Distributed computing,Network packet,Parallel computing,Fault tolerance | Journal |
Volume | Issue | ISSN |
55 | 4 | 0018-9340 |
Citations | PageRank | References |
56 | 1.74 | 44 |
Authors | ||
8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Maria Engracia Gomez | 1 | 69 | 3.10 |
Nils Agne Nordbotten | 2 | 90 | 5.78 |
Jose Flich | 3 | 125 | 8.18 |
Pedro Lopez | 4 | 387 | 27.39 |
Antonio Robles | 5 | 481 | 30.40 |
Jose Duato | 6 | 893 | 54.65 |
Tor Skeie | 7 | 1103 | 74.67 |
Olav Lysne | 8 | 797 | 54.53 |