Title
Improving Application Resilience by Extending Error Correction with Contextual Information
Abstract
Extreme-scale systems are growing in scope and complexity as we approach exascale. Uncorrectable faults in such systems are also increasing, so resilience efforts addressing these are of great importance. In this paper, we extend a method that augments hardware error detection and correction (EDAC) contextually, and show an application-based approach that takes detectable uncorrectable (DUE) data errors and corrects them. We applied this application-based method successfully to data errors found using common EDAC, and discuss operating system changes that will make this possible on existing systems. We show that even when there are many acceptable correction choices (which may be seen in floating point), a large percentage of DUEs are corrected, and even the miscorrected data are very close to correct. We developed two different contextual criteria for this application: local averaging and global conservation of mass. Both did well in terms of closeness, but conservation of mass outperformed averaging in terms of actual correctness. The contributions of this paper are: 1) the idea of application- specific EDAC-based contextual correction, 2) its demonstration with great success on a real application, 3) the development of two different contextual criteria, and 4) a discussion of attainable changes to the OS kernel that make this possible on a real system.
Year
DOI
Venue
2018
10.1109/FTXS.2018.00006
2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Keywords
Field
DocType
fault-tolerance,high-performance-computing,error-correction-codes
Psychological resilience,Supercomputer,Floating point,Computer science,Closeness,Correctness,Error detection and correction,Fault tolerance,Computer engineering,Conservation of mass
Conference
ISBN
Citations 
PageRank 
978-1-7281-0223-8
0
0.34
References 
Authors
0
8
Name
Order
Citations
PageRank
Alexandra Poulos100.34
Dylan Wallace200.34
robert robey361.95
Laura Monroe412910.08
Vanessa Job500.34
Sean Blanchard619013.20
William M. Jones78610.16
Nathan DeBardeleben849031.71