Title
Supporting User-directed Fault Tolerance over Standard MPI
Abstract
User-directed means the process of carrying out fault tolerance is dynamic and the fault tolerance mode is chosen by users based on application requirements. In this paper, we introduce a general scheme based on standard MPI to provide the user directed support for application level algorithmic fault tolerance. The user-directed fault tolerance plays the role as a connection between applications and algorithmic fault tolerance. As a case study, our scheme has been incorporated to HPL combined with a non-blocking ABFT technique. We have tested the functional availability of our scheme for fault tolerance in real circumstance. We also evaluated that when there is no failure occurring, our support only brings 2.5 percent overhead. When failure occurs, with our scheme, the scalability of algorithmic fault tolerance maintains well.
Year
DOI
Venue
2012
10.1109/ICPADS.2012.100
ICPADS
Keywords
Field
DocType
application program interfaces,fault tolerance mode,hpl,algorithmic fault tolerance,fault tolerant computing,standard mpi,fault tolerance,case study,application level algorithmic fault,application requirement,nonblocking abft technique,application level algorithmic fault tolerance,message passing,application-level,functional availability,user-directed fault tolerance mode,user-directed fault tolerance,non-blocking abft technique,general scheme
General protection fault,Fault coverage,Computer science,Software fault tolerance,Real-time computing,Fault tolerance,Message passing,Distributed computing,Scalability
Conference
ISSN
ISBN
Citations 
1521-9097 E-ISBN : 978-0-7695-4903-3
978-0-7695-4903-3
1
PageRank 
References 
Authors
0.35
1
5
Name
Order
Citations
PageRank
Zhimin Wu110.69
Rui Wang210.35
XU Wei-Zhi3368.65
Ming-yu Chen490279.29
Erlin Yao516310.93