Title
Flexible Error Recovery Using Versions in Global View Resilience
Abstract
We present the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We briefly describe GVR's interfaces for distributed arrays, versioning, and cross-layer error recovery. We illustrate how GVR can be used for rollback recovery and a wide range additional error recovery techniques including forward recovery for latent errors or silent data corruptions. Application results demonstrate that GVR's interfaces and implementation are portable, flexible (support a variety of recovery models), efficient and create a gentle-slope path to tolerate growing error rates in future systems.
Year
DOI
Venue
2015
10.1109/CLUSTER.2015.88
Cluster Computing
Keywords
Field
DocType
Resilience,Fault tolerance,Exascale,Scalable computing,Application-based fault tolerance
Psychological resilience,Forward error correction,Monte Carlo method,Computer science,Parallel computing,Real-time computing,Fault tolerance,Rollback recovery,Distributed computing,Software versioning,Scalable computing
Conference
ISSN
Citations 
PageRank 
1552-5244
1
0.38
References 
Authors
2
9
Name
Order
Citations
PageRank
Nan Dun1415.93
Hajime Fujita2365.29
Aiman Fang370.81
Yan Liu410.38
Andrew A. Chien53696405.97
Pavan Balaji61475111.48
Kamil Iskra764246.46
Wesley Bland830.76
Andrew R. Siegel9427.33