Title | ||
---|---|---|
Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply |
Abstract | ||
---|---|---|
With the rate of errors that silently effect an application's state/output expected to increase in future HPC machines, numerous mitigation schemes have been proposed, but little work has been done investigating why these schemes detect some error while other is masked. This paper investigates how silent data corruption (SDC) propagates through a sparse matrix vector multiply (SpMV), a fundamental HPC computation kernel. We discover that analyzing the mathematics of the SpMV limits understanding of SDC propagation. We achieve a more complete understanding by investigating how SDC propagates in a SpMV as it is expressed in machine instructions. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1109/CLUSTER.2015.101 | Cluster Computing |
Keywords | Field | DocType |
Silent Data Corruption, Error Propagation | Kernel (linear algebra),Propagation of uncertainty,Silent data corruption,Sparse matrix vector,Iterative method,Computer science,Electric breakdown,Parallel computing,Sparse matrix,Distributed computing,Computation | Conference |
ISSN | Citations | PageRank |
1552-5244 | 1 | 0.35 |
References | Authors | |
3 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jon Calhoun | 1 | 47 | 4.75 |
M. Snir | 2 | 3984 | 520.82 |
Luke Olson | 3 | 235 | 21.93 |
María Jesús Garzarán | 4 | 411 | 34.13 |