Cache‐oblivious matrix algorithms in the age of multicores and many cores - Citegraph

Paper Info

Title
Cache‐oblivious matrix algorithms in the age of multicores and many cores

Abstract
This article highlights the issue of upcoming wider single-instruction, multiple-data units as well as steadily increasing core counts on contemporary and future processor architectures. We present the recent port to and latest results of cache-oblivious algorithms and implementations of our TifaMMy code on four architectures: SGI's UltraViolet distributed shared-memory machine, Intel's latest x86 architecture code-named Sandy Bridge, AMD's new Bulldozer architecture, and Intel's future Many Integrated Core architecture. TifaMMy's matrix multiplication and LU decomposition routines have been adapted and tuned with regard to these architectures. Results are discussed and compared with vendors' architecture-specific and optimized libraries, Math Kernel Library and AMD Core Math Library, for both a standard C++ version with vectorization compiler switches and TifaMMy's highly optimized vector intrinsics version. We provide insights into architectural properties and comment on the feasibility of heterogeneous cores and accelerators, namely graphics processing units. Besides bare-metal performance, the test platforms' ease of use is analyzed in detail, and the portability of our approach to new and upcoming silicon is discussed with regard to required effort on code change abstraction levels.As a result, we demonstrate that because of its generic structure in terms of memory organization, TifaMMy executes with equally efficient performance on all four architectures as it automatically adapts itself to architectural parameters without losing performance against the Math Kernel Library and AMD Core Math Library, underlining its generic and cache-oblivious properties, as the porting effort was relatively low compared with that in other implementations.Copyright (c) 2012 John Wiley & Sons, Ltd.

Year	DOI	Venue
2015	10.1002/cpe.2974	CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
Keywords	Field	DocType
shared-memory platforms,cache oblivious,block recursive,linear algebra,performance,parallelization	x86,Computer science,Vectorization (mathematics),Intrinsics,LU decomposition,Distributed computing,Computer architecture,Cache-oblivious algorithm,Parallel computing,Algorithm,Compiler,Porting,Software portability	Journal
Volume	Issue	ISSN
27	SP9	1532-0626
Citations	PageRank	References
2	0.40	11
Authors
2

Authors (2 rows)

Cited by (2 rows)

References (11 rows)

Name	Order	Citations	PageRank
Alexander Heinecke	1	344	32.67
Carsten Trinitis	2	151	29.80

1