Title
WarpPool: sharing requests with inter-warp coalescing for throughput processors
Abstract
Although graphics processing units (GPUs) are capable of high compute throughput, their memory systems need to supply the arithmetic pipelines with data at a sufficient rate to avoid stalls. For benchmarks that have divergent access patterns or cause the L1 cache to run out of resources, the link between the GPU's load/store unit and the L1 cache becomes a bottleneck in the memory system, leading to low utilization of compute resources. While current GPU memory systems are able to coalesce requests between threads in the same warp, we identify a form of spatial locality between threads in multiple warps. We use this locality, which is overlooked in current systems, to merge requests being sent to the L1 cache. This relieves the bottleneck between the load/store unit and the cache, and provides an opportunity to prioritize requests to minimize cache thrashing. Our implementation, WarpPool, yields a 38% speedup on memory throughput-limited kernels by increasing the throughput to the L1 by 8% and the reducing the number of L1 misses by 23%. We also demonstrate that WarpPool can improve GPU programmability by achieving high performance without the need to optimize workloads' memory access patterns. A Verilog implementation including place-and route shows WarpPool requires 1.0% added GPU area and 0.8% added power.
Year
DOI
Venue
2015
10.1145/2830772.2830830
MICRO
Keywords
Field
DocType
GPGPU, memory coalescing, memory divergence
Uniform memory access,Cache pollution,Computer science,Cache,Parallel computing,Cache-only memory architecture,Cache algorithms,Real-time computing,Page cache,Non-uniform memory access,Cache coloring,Operating system
Conference
ISBN
Citations 
PageRank 
978-1-5090-6601-8
6
0.44
References 
Authors
21
7
Name
Order
Citations
PageRank
John Kloosterman170.78
Jonathan Beaumont2362.85
Mick Wollman360.44
Ankit Sethia41054.91
Ronald G. Dreslinski5125881.02
Trevor Mudge66139659.74
Scott Mahlke74811312.08