Title
Building Single Fault Survivable Parallel Algorithms for Matrix Operations Using Redundant Parallel Computation
Abstract
As the size of today's high performance computers continue to grow, node failures in these computers are becoming frequent events. Although checkpoint is the typical technique to tolerate such failures, it often introduces a considerable overhead and has shown poor scalability on today's large scale systems. In this paper we defined a new term called fault tolerant parallel algorithm which means that the algorithm gets the correct answer despite the failure of nodes. The fault tolerance approach in which the data of failed processes is recovered by modifying applications to recompute on all surviving processes is checkpoint-free. In particular, if no failure occurs, the fault tolerant parallel algorithms are the same as the original algorithms. We show the practicality of this technique by applying it to parallel dense matrix-matrix multiplication and Gaussian elimination to tolerate single process failure. Experimental results demonstrate that a process failure can be tolerated with a good scalability for the two fault tolerant parallel algorithms and the proposed fault tolerant parallel dense matrix-matrix multiplication is able to survive process failure with a very low performance overhead. The main drawback of this approach is non-transparent and algorithm-dependent.
Year
DOI
Venue
2007
10.1109/CIT.2007.27
CIT
Keywords
Field
DocType
fault tolerant parallel algorithm,single process failure,fault toler,dense matrix-matrix multiplication,redundant parallel computation,gaussian elimination,parallel dense matrix-matrix multiplication,failed process,matrix operations,fault tolerant computing,matrix algebra,fault survivable parallel algorithms,parallel algorithms,gaussian processes,node failure,ant parallel algorithm,fault tolerance approach,process failure,building single fault survivable,proposed fault tolerant,parallel computer,parallel algorithm,matrix multiplication,fault tolerant
Computer science,Parallel algorithm,Parallel computing,Process failure,Multiplication,Fault tolerance,Gaussian process,Gaussian elimination,Matrix multiplication,Distributed computing,Scalability
Conference
ISBN
Citations 
PageRank 
978-0-7695-2983-7
3
0.40
References 
Authors
6
6
Name
Order
Citations
PageRank
Yunfei Du17214.62
Panfeng Wang2346.12
Hongyi Fu36812.50
Jia Jia4364.01
Haifang Zhou5359.33
Xuejun Yang667873.26