Title
Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply
Abstract
With the rate of errors that silently effect an application's state/output expected to increase in future HPC machines, numerous mitigation schemes have been proposed, but little work has been done investigating why these schemes detect some error while other is masked. This paper investigates how silent data corruption (SDC) propagates through a sparse matrix vector multiply (SpMV), a fundamental HPC computation kernel. We discover that analyzing the mathematics of the SpMV limits understanding of SDC propagation. We achieve a more complete understanding by investigating how SDC propagates in a SpMV as it is expressed in machine instructions.
Year
DOI
Venue
2015
10.1109/CLUSTER.2015.101
Cluster Computing
Keywords
Field
DocType
Silent Data Corruption, Error Propagation
Kernel (linear algebra),Propagation of uncertainty,Silent data corruption,Sparse matrix vector,Iterative method,Computer science,Electric breakdown,Parallel computing,Sparse matrix,Distributed computing,Computation
Conference
ISSN
Citations 
PageRank 
1552-5244
1
0.35
References 
Authors
3
4
Name
Order
Citations
PageRank
Jon Calhoun1474.75
M. Snir23984520.82
Luke Olson323521.93
María Jesús Garzarán441134.13