Abstract | ||
---|---|---|
The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology. |
Year | Venue | Field |
---|---|---|
2015 | SpringSim (HPS) | Convergence (routing),Silent data corruption,Computer science,Parallel computing,Fault tolerance,Performance model,Solver,Multigrid method,Distributed computing,Computational complexity theory |
DocType | Citations | PageRank |
Conference | 2 | 0.38 |
References | Authors | |
13 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jon Calhoun | 1 | 47 | 4.75 |
Luke Olson | 2 | 235 | 21.93 |
M. Snir | 3 | 3984 | 520.82 |
William D. Gropp | 4 | 5547 | 548.31 |