High-order stencil computations on multicore clusters - Citegraph

Paper Info

Title
High-order stencil computations on multicore clusters

Abstract
Stencil computation (SC) is of critical importance for broad scientific and engineering applications. However, it is a challenge to optimize complex, high-order SC on emerging clusters of multicore processors. We have developed a hierarchical SC parallelization framework that combines: (1) spatial decomposition based on message passing; (2) multithreading using critical section-free, dual representation; and (3) single-instruction multiple-data (SIMD) parallelism based on various code transformations. Our SIMD transformations include translocated statement fusion, vector composition via shuffle, and vectorized data layout reordering (e.g. matrix transpose), which are combined with traditional optimization techniques such as loop unrolling. We have thereby implemented two SCs of different characteristics-diagonally dominant, lattice Boltzmann method (LBM) for fluid flow simulation and highly off-diagonal (6-th order) finite-difference time-domain (FDTD) code for seismic wave propagation-on a Cell Broadband Engine (Cell BE) based system (a cluster of PlayStation3 consoles), a dual Intel quadcore platform, and IBM BlueGene/L and P. We have achieved high inter-node and intra-node (multithreading and SIMD) scalability for the diagonally dominant LBM: Weak-scaling parallel efficiency 0.978 on 131,072 BlueGene/P processors; strong-scaling multithreading efficiency 0.882 on 6 cores of Cell BE; and strong-scaling SIMD efficiency 0.780 using 4-element vector registers of Cell BE. Implementation of the high-order SC, on the contrary, is less efficient due to long-stride memory access and the limited size of the vector register file, which points out the need for further optimizations.

Year	DOI	Venue
2009	10.1109/IPDPS.2009.5161011	IPDPS
Keywords	Field	DocType
vector register file,seismic wave propagation,high-order sc,high-order stencil computations,multicore cluster,cell broadband engine,simd transformation,high-order stencil computation,strong-scaling simd efficiency,multi-threading,lattice boltzmann method,parallel efficiency,hierarchical stencil computation parallelization,4-element vector register,hierarchical sc parallelization framework,single-instruction multiple-data parallelism,finite-difference time-domain code,vector composition,multithreading,code transformations,message passing,multicore clusters,dual intel quadcore platform,vectorized data layout reordering,multicore processors,multithreading efficiency,computer architecture,parallel processing,critical section,finite difference methods,lattice boltzmann methods,multicore processing,computational modeling,fluid flow,single instruction multiple data,register file,matrix decomposition,lattices,registers,finite difference time domain,multi threading	Multithreading,Transpose,Computer science,Parallel computing,Stencil,SIMD,Stencil code,Register file,Loop unrolling,Multi-core processor,Distributed computing	Conference
ISSN	ISBN	Citations
1530-2075 E-ISBN : 978-1-4244-3750-4	978-1-4244-3750-4	24
PageRank	References	Authors
1.07	4	10

Authors (10 rows)

Cited by (24 rows)

References (4 rows)

Name	Order	Citations	PageRank
Liu Peng	1	71	6.17
Richard Seymour	2	62	3.78
Ken-ichi Nomura	3	132	13.36
Rajiv K. Kalia	4	239	35.66
Aiichiro Nakano	5	279	47.53
Priya Vashishta	6	243	37.69
Alexander Loddoch	7	24	1.07
Michael Netzband	8	24	1.07
William R. Volz	9	31	2.12
Chap C. Wong	10	24	1.07

1