Title
Towards a More Complete Understanding of SDC Propagation.
Abstract
With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the domain that is affected by a detected error. This paper investigates how silent data corruption (SDC) due to soft errors propagates through three HPC applications: HPCCG, Jacobi, and CoMD. To allow for more detailed view of error propagation, the paper tracks propagation at the instruction and application variable level. The impact of detection latency on error propagation is shown along with an application's ability to recover. Finally, the impact of compiler optimizations are explored along with the impact of local problem size on error propagation.
Year
DOI
Venue
2017
10.1145/3078597.3078617
HPDC
Field
DocType
Citations 
Propagation of uncertainty,Silent data corruption,Latency (engineering),Computer science,Parallel computing,Optimizing compiler,Error detection and correction,Real-time computing,Computation,Distributed computing
Conference
3
PageRank 
References 
Authors
0.38
41
4
Name
Order
Citations
PageRank
Jon Calhoun1474.75
M. Snir23984520.82
Luke Olson323521.93
William D. Gropp45547548.31