Title
Locality-Aware Mapping of Nested Parallel Patterns on GPUs
Abstract
Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.
Year
DOI
Venue
2014
10.1109/MICRO.2014.23
MICRO
Keywords
Field
DocType
parallel processing,gpu-specific hard constraints,gpu-specific soft constraints,parallelism degree,shared memory,gpu hardware,2d mapping,block size parameterization,parallel semantics encoding,1d mapping,graphics processing units,gpu kernels,high level computation patterns,logical multidimensional domain,nested parallel pattern,higher level languages,locality-aware mapping,general analysis framework,shared memory systems,pattern mapping,compiler optimizations,optimising compilers,error correcting code,resilience,hardware,kernel,optimization,dram,faults,programming,instruction sets
Kernel (linear algebra),Locality,Shared memory,Computer science,Degree of parallelism,CUDA,Instruction set,Parallel computing,Optimizing compiler,Code generation,Theoretical computer science
Conference
ISSN
Citations 
PageRank 
1072-4451
20
0.88
References 
Authors
19
5
Name
Order
Citations
PageRank
HyoukJoong Lee141417.71
Kevin J. Brown244818.62
Arvind K. Sujeeth350220.58
Tiark Rompf474345.86
Kunle Olukotun54532373.50