Title
Anatomy of High-Performance Many-Threaded Matrix Multiplication
Abstract
BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the \"GotoBLAS approach\" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting G E M M becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.
Year
DOI
Venue
2014
10.1109/IPDPS.2014.110
IPDPS
Keywords
Field
DocType
linear algebra, libraries, high-performance, matrix, blas, multicore,scalability,integrated circuits,kernel,linear algebra,instruction sets,matrix multiplication,multicore,multi threading,computer architecture,matrix,blas,parallel processing,multithreading
Kernel (linear algebra),Multithreading,Computer science,Xeon Phi,Parallel computing,Porting,Matrix multiplication,Multi-core processor,PowerPC,Scalability
Conference
ISSN
Citations 
PageRank 
1530-2075
39
1.30
References 
Authors
12
6
Name
Order
Citations
PageRank
Tyler M. Smith1905.37
Robert A. van de Geijn22047203.08
Mikhail Smelyanskiy3116065.96
Jeff R. Hammond426218.06
Field G. Van Zee531223.19
Van De Geijn, R.6461.83