Title
Adaptive Winograd's matrix multiplications
Abstract
Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of counting cost of simple CPU operations. We present a novel, hybrid, and adaptive recursive Strassen-Winograd's matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in double precision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent in-complexity sub-MMs and does not require any extra computation to combine the intermediary sub-MM results. We achieve up to 22% execution-time reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19% for a two dual-core processor system. Most importantly, even for small matrices such as 1500 × 1500, our approach attains already 10% execution-time reduction and, for MM of matrices larger than 3000× 3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to faster-than-processor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.
Year
DOI
Venue
2009
10.1145/1486525.1486528
ACM Trans. Math. Softw.
Keywords
Field
DocType
additional key words and phrases: winograd's matrix multiplications,classic algorithm,fast algorithms,adaptive winograd,execution-time reduction,dual-core processor system,good performance,classic mm algorithm,gflops peak performance,matrix multiplication,single core system,gflops performance,classic o,peak performance,linear algebra
Linear algebra,Single-core,Mathematical optimization,Matrix (mathematics),FLOPS,Computer science,Parallel computing,Double-precision floating-point format,Fortran,Theoretical computer science,Matrix multiplication,Computation
Journal
Volume
Issue
ISSN
36
1
0098-3500
Citations 
PageRank 
References 
12
0.86
35
Authors
2
Name
Order
Citations
PageRank
Paolo D'Alberto113611.24
Alexandru Nicolau22265307.74