Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels - Citegraph

Paper Info

Title
Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Abstract
There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor guidance on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ~1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ~128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ~8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ~2-5% performance loss.

Year	DOI	Venue
2016	10.1109/MASCOTS.2016.58	2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)
Keywords	Field	DocType
GPU,CPU Synchronization,Dynamic Parallelism	Kernel (linear algebra),Byte,Central processing unit,Synchronization,Small data,Computer science,Instruction set,Data synchronization,Parallel computing,Real-time computing,PCI Express	Conference
ISSN	ISBN	Citations
1526-7539	978-1-5090-3433-8	0
PageRank	References	Authors
0.34	12	2

Authors (2 rows)

Cited by (0 rows)

References (12 rows)

Name	Order	Citations	PageRank
Islam Harb	1	0	0.34
Wu-chun Feng	2	2812	232.50

1