Title
Resilience for Massively Parallel Multigrid Solvers.
Abstract
Fault tolerant massively parallel multigrid methods for elliptic partial differential equations are a step towards resilient solvers. Here, we combine domain partitioning with geometric multigrid methods to obtain fast and fault-robust solvers for three-dimensional problems. The recovery strategy is based on the redundant storage of ghost values, as they are commonly used in distributed memory parallel programs. In the case of a fault, the redundant interface values can be easily recovered, while the lost inner unknowns are recomputed approximately with recovery algorithms using multigrid cycles for solving a local Dirichlet problem. Different strategies are compared and evaluated with respect to performance, computational cost, and speedup. Especially effective are asynchronous strategies combining global solves with accelerated local recovery. By this, multiple faults can be fully compensated with respect to both the number of iterations and run-time. For illustration, we use a state-of-the-art petascale supercomputer to study failure scenarios when solving systems with up to 6 center dot 10(11) (0.6 trillion) unknowns.
Year
DOI
Venue
2016
10.1137/15M1026122
SIAM JOURNAL ON SCIENTIFIC COMPUTING
Keywords
Field
DocType
fault tolerant algorithms,massively parallel and asynchronous multigrid
Asynchronous communication,Dirichlet problem,Massively parallel,Computer science,Parallel computing,Distributed memory,Fault tolerance,Elliptic partial differential equation,Multigrid method,Speedup
Journal
Volume
Issue
ISSN
38
5
1064-8275
Citations 
PageRank 
References 
6
0.41
31
Authors
4
Name
Order
Citations
PageRank
Markus Huber1213.12
Björn Gmeiner2786.24
Ulrich Rüde3383.97
Barbara I. Wohlmuth432050.97