Title
The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery
Abstract
This paper addresses the issue of fault tolerance in parallel computing, and proposes a new method named parallel recomputing. Such method achieves fault recovery automatically by using surviving processes to recompute the workload of failed processes in parallel. The paper firstly defines the fault tolerant parallel algorithm (FTPA) as the parallel algorithm which tolerates failures by parallel recomputing. Furthermore, the paper proposes the inter-process definition-use relationship analysis method based on the conventional definition-use analysis for revealing the relationship of variables in different processes. Under the guidance of this new method, principles of fault tolerant parallel algorithm design are given. At last, the authors present the design of FTPAs for matrix-matrix multiplication and NPB kernels, and evaluate them by experiments on a cluster system. The experimental results show that the overhead of FTPA is less than the overhead of checkpointing.
Year
DOI
Venue
2007
10.1109/PACT.2007.73
PACT
Keywords
Field
DocType
conventional definition-use analysis,parallel algorithm,parallel recomputing,inter-process definition-use relationship analysis,fault tolerant,fault recovery,paper firstly,parallel computing,new method,failure recovery,fault tolerance,fault tolerant parallel algorithm,parallel computer,parallel algorithms,matrix multiplication
Relationship analysis,Workload,Computer science,Parallel algorithm,Parallel algorithm design,Parallel computing,Real-time computing,Multiplication,Fault tolerance,Bulk synchronous parallel,Cost efficiency
Conference
ISSN
ISBN
Citations 
1089-795X
0-7695-2944-5
10
PageRank 
References 
Authors
0.84
14
7
Name
Order
Citations
PageRank
Xuejun Yang167873.26
Yunfei Du27214.62
Panfeng Wang3346.12
Hongyi Fu46812.50
Jia Jia5364.01
Zhiyuan Wang6576.37
Suo Guang7605.32