Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs - Citegraph

Paper Info

Title
Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

Abstract
On-chip caches are commonly used in computer systems to hide long off-chip memory access latencies. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-art accelerators, such as the NVIDIA Fermi or Kepler GPUs and Intel's forthcoming MIC “Knights Landing” (KNL), support both software-managed caches, aka. shared memory (GPUs) or near memory (KNL), and hardware-managed L1 data caches (D-caches). Furthermore, shared memory and the L1 D-cache on a GPU utilize the same physical storage and their capacity can be configured at runtime (same for KNL). In this paper, we present an in-depth study to reveal interesting and sometimes unexpected tradeoffs between shared memory and the hardware-managed L1 D- caches in GPU architecture. In our study, the kernels utilizing the L1 D-caches are generated from those leveraging shared memory to ensure that the same optimizations such as tiling are applied equally in both versions. Our detailed analyses reveal that rather than cache hit rates, the following tradeoffs often have more profound performance impacts. On one hand, the kernels utilizing the L1 caches may support higher degrees of thread-level parallelism, offer more opportunities for data to be allocated in registers, and sometimes result in lower dynamic instruction counts. On the other hand, the applications utilizing shared memory enable more coalesced accesses and tend to achieve higher degrees of memory-level parallelism. Overall, our results show that most benchmarks perform significantly better with shared memory than the L1 D-caches due to the high impact of memory-level parallelism and memory coalescing.

Year	DOI	Venue
2014	10.1109/ISPASS.2014.6844487	ISPASS
Keywords	Field	DocType
memory-level parallelism,memory coalescing,shared memory,data allocation,cache storage,dynamic instruction counts,hardware-managed l1 data caches,software-managed caches,graphics processing units,near memory,multi-threading,gpu architecture,shared memory systems,registers,cache hit rates,thread-level parallelism,accelerators,l1 d-caches,hardware-managed caches,intel mic knights landing,off-chip memory access latencies,on-chip caches,kepler gpus,nvidia fermi,computer architecture,thread level parallelism,memory level parallelism,parallel processing,kernel,multi threading	Interleaved memory,Uniform memory access,Shared memory,Computer science,Parallel computing,Distributed memory,False sharing,Bus sniffing,Computer hardware,Distributed shared memory,Operating system,Cache coherence	Conference
Citations	PageRank	References
13	0.60	8
Authors
6

Authors (6 rows)

Cited by (13 rows)

References (8 rows)

Name	Order	Citations	PageRank
Chao Li	1	132	6.04
Yi Yang	2	279	14.78
Hongwen Dai	3	28	3.14
Shengen Yan	4	125	5.25
Frank Mueller	5	3497	219.77
Huiyang Zhou	6	994	63.26

1