Title
Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults
Abstract
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have been done over the last decade in realizing efficient techniques to solve such systems, possibly relying on High Performance Computing (HPC) infrastructures to boost the performance. In this framework, the ever-growing scale of supercomputers inevitably increases the frequency of faults, making it a crucial issue of HPC application development.A previous study [1] investigated the possibility to enhance the Inhibition Method (IMe) -a linear systems solver for dense unstructured matrices-with fault tolerance to single hard errors, i.e. failures causing one computing processor to stop.This article extends [1] by proposing an efficient technique to obtain fault tolerance to multiple hard errors, which may occur concurrently on different processors belonging to the same or different machines. An improved parallel implementation is also proposed, which is particularly suitable for HPC environments and moves towards the direction of a complete decentralization. The theoretical analysis suggests that the technique (which does not require check pointing, nor rollback) is able to provide fault tolerance to multiple faults at the price of a small overhead and a limited number of additional processors to store the checksums. Experimental results on a HPC architecture validate the theoretical study, showing promising performance improvements w.r.t. a popular fault-tolerant solving technique.
Year
DOI
Venue
2020
10.1109/SRDS51746.2020.00034
2020 International Symposium on Reliable Distributed Systems (SRDS)
Keywords
DocType
ISSN
Fault tolerance,multiple hard faults,High Performance Computing,linear equation systems solver,Inhibition Method
Conference
1060-9857
ISBN
Citations 
PageRank 
978-1-7281-7627-7
0
0.34
References 
Authors
30
3
Name
Order
Citations
PageRank
Daniela Loreti1286.55
M. Artioli200.68
A. Ciampolini331.74