Title
GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores
Abstract
To tailor for DNN (Deep Neural Network) acceleration, GPU has migrated to new architectures such as NVIDIA Volta and Turing that incorporate dedicated Tensor Cores. Although good at GEMM (generic matrix-matrix multiplication), Tensor Cores still have inefficiency facing convolutions with certain layer structures. This paper proposes a GPNPU (General-Purpose Neural-network Processing Unit) architecture, which offers another option of direct convolution in GPU. It stitches the direct convolution dataflow into the Tensor Cores with little hardware support, and resorts to regulated data layout with stripe-mined convolution execution to achieve higher performance and power efficiency, while retaining the general programability as GPU. We further apply a unified core design to support varied operand types and precision for higher computing throughput. The evaluation shows that GPNPU can outperform Tensor Cores on typical DNNs by 1.4X for inference (FP16) and 1.2X for training with much reduced power. The INT8 performance even increases to 2.4X. Our study demonstrates that it is possible and appealing to refine the Tensor Cores for greater DNN acceleration, while conforming to GPU architecture for the programmability necessary in future DNN evolution.
Year
DOI
Venue
2020
10.1109/DAC18072.2020.9218566
2020 57th ACM/IEEE Design Automation Conference (DAC)
DocType
ISSN
ISBN
Conference
0738-100X
978-1-7281-1085-1
Citations 
PageRank 
References 
0
0.34
0
Authors
7
Name
Order
Citations
PageRank
Zhuoran Song194.39
jianfei wang243.19
Li Tianjian395.53
Li Jiang428631.86
Jing Ke532.75
Xiaoyao Liang658545.81
Naifeng Jing715227.07