Title
Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels
Abstract
There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor guidance on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ~1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ~128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ~8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ~2-5% performance loss.
Year
DOI
Venue
2016
10.1109/MASCOTS.2016.58
2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)
Keywords
Field
DocType
GPU,CPU Synchronization,Dynamic Parallelism
Kernel (linear algebra),Byte,Central processing unit,Synchronization,Small data,Computer science,Instruction set,Data synchronization,Parallel computing,Real-time computing,PCI Express
Conference
ISSN
ISBN
Citations 
1526-7539
978-1-5090-3433-8
0
PageRank 
References 
Authors
0.34
12
2
Name
Order
Citations
PageRank
Islam Harb100.34
Wu-chun Feng22812232.50