Title
Towards a more fault resilient multigrid solver.
Abstract
The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology.
Year
Venue
Field
2015
SpringSim (HPS)
Convergence (routing),Silent data corruption,Computer science,Parallel computing,Fault tolerance,Performance model,Solver,Multigrid method,Distributed computing,Computational complexity theory
DocType
Citations 
PageRank 
Conference
2
0.38
References 
Authors
13
4
Name
Order
Citations
PageRank
Jon Calhoun1474.75
Luke Olson223521.93
M. Snir33984520.82
William D. Gropp45547548.31