Title
Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs.
Abstract
The cost of the iterative solution of a sparse matrix–vector system against multiple vectors is a common challenge within scientific computing. A tremendous number of algorithmic advances, such as eigenvector deflation and domain-specific multi-grid algorithms, have been ubiquitously beneficial in reducing this cost. However, they do not address the intrinsic memory-bandwidth constraints of the matrix–vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix–vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. Practical implementations typically suffer from the quadratic scaling in the number of vector–vector operations. We present an implementation of the block Conjugate Gradient algorithm on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector–vector operations from quadratic to linear. As a representative case, we consider the domain of lattice quantum chromodynamics and present results for one of the fermion discretizations. Using the QUDA library as a framework, we demonstrate a 5× speedup compared to highly-optimized independent Krylov solves on NVIDIA’s SaturnV cluster.
Year
DOI
Venue
2018
10.1016/j.cpc.2018.06.019
Computer Physics Communications
Keywords
Field
DocType
Block solver,GPU
Conjugate gradient method,Mathematical optimization,Memory bandwidth,Lattice (order),Cache,Quadratic equation,Scaling,Mathematics,Eigenvalues and eigenvectors,Speedup
Journal
Volume
ISSN
Citations 
233
0010-4655
0
PageRank 
References 
Authors
0.34
20
5