Title
Check-Pointing Approach for Fault Tolerance in OpenSHMEM.
Abstract
Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM, since there is no matching communication call at the target processing element PE. In this paper we explore a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM. Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving.
Year
DOI
Venue
2015
10.1007/978-3-319-26428-8_3
OpenSHMEM
DocType
Volume
ISSN
Conference
9397
0302-9743
Citations 
PageRank 
References 
0
0.34
6
Authors
5
Name
Order
Citations
PageRank
Pengfei Hao140.76
Swaroop Pophale29310.53
Pavel Shamis310013.86
Tony Curtis4684.37
Barbara M. Chapman5904119.20