Title
A novel fault-tolerant parallel algorithm
Abstract
The mean-time-between-failure of current high-performance computer systems is much shorter than the running times of many computational applications, whereas those applications are the main workload for those systems. Currently, checkpoint/restart is the most commonly used scheme for such applications to tolerate hardware failures. But this scheme has its performance limitation when the number of processors becomes much larger. In this paper, we propose a novel fault-tolerant parallel algorithm FPAPR. First, we introduce the basic idea of FPAPR. Second, we specify the details of how to implement a FPAPR program by using two NPB kernels as examples. Third, we theoretically analyze the overhead of FPAPR, and find out that the overhead of FPAPR decreases with the increase of the number of processors. At last, the experimental results on a 512-CPU cluster show the overhead introduced by the algorithm is very small.
Year
DOI
Venue
2007
10.1007/978-3-540-76837-1_6
APPT
Keywords
Field
DocType
basic idea,computational application,npb kernel,512-cpu cluster,current high-performance computer system,fpapr program,fpapr decrease,hardware failure,novel fault-tolerant parallel algorithm,mean time between failure,parallel algorithm,fault tolerant,fault tolerance,high performance computing
Supercomputer,Workload,Computer science,Parallel algorithm,Parallel computing,Fault tolerance
Conference
Volume
ISSN
ISBN
4847
0302-9743
3-540-76836-X
Citations 
PageRank 
References 
0
0.34
9
Authors
6
Name
Order
Citations
PageRank
Panfeng Wang1346.12
Yunfei Du27214.62
Hongyi Fu36812.50
Haifang Zhou4359.33
Xuejun Yang567873.26
Wenjing Yang63716.43