Abstract | ||
---|---|---|
Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM, since there is no matching communication call at the target processing element PE. In this paper we explore a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM. Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1007/978-3-319-26428-8_3 | OpenSHMEM |
DocType | Volume | ISSN |
Conference | 9397 | 0302-9743 |
Citations | PageRank | References |
0 | 0.34 | 6 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Pengfei Hao | 1 | 4 | 0.76 |
Swaroop Pophale | 2 | 93 | 10.53 |
Pavel Shamis | 3 | 100 | 13.86 |
Tony Curtis | 4 | 68 | 4.37 |
Barbara M. Chapman | 5 | 904 | 119.20 |