Title
High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression
Abstract
The growing interest in using FPGAs to accelerate convolutional neural network (CNN) workloads is driving the deployment of FPGAs on cloud services such as Amazon AWS and Microsoft Azure. Such current cloud-based FPGAs have serious problems concerning data transfer bandwidth. In this paper, we compress a transfer image using customized JPEG coding and implement a customized image decoder architecture. We analyze the trade-off between data transfer speed-up and recognition accuracy drop. Based on this compression scheme, we design a high-throughput CNN inference engine. Almost all existing FPGA-based CNN accelerators are based with the same idea as their GPU counterparts, where operations from different network layers are mapped onto the same hardware units working in a multiplexed way. Our fully pipelined architecture maps all the network layers on-chip and transfers the computation from different layers to their unit with independent optimization. We apply two CNN optimization techniques to a residual network, one is a channel shift and point-wise approximation, and the other is a binary weight quantization. We implement the proposed CNN inference accelerator on the Xilinx Virtex UltraScale+ XCVU9P FPGA. Our system peak-performance achieves 2.41 TOPS. Our compressed JPEG image transfer only consumes 4% of the system resource, drops 0.3 points of accuracy and achieves 81,120 FPS which is 65.27 times faster than the conventional straightforward RGB data transfer. Thus, our proposed data transfer architecture is sufficient to increase system performance. As for the system throughput, our system is 3.84-34.41 times higher than existing FPGA implementations. Compared with the Xeon CPU, it achieves 138.38 times higher throughput, and it dissipates 1.2 times lower power, so its efficiency is 177.12 times better. Compared with the Tesla V100 GPU, it achieves 9.48 times higher throughput, dissipates 3.9 times lower power, and its efficiency is 37.52 times better. Thus, our parallel architecture on an FPGA provides superior throughput for the acceleration of a CNN.
Year
DOI
Venue
2020
10.1109/FCCM48280.2020.00010
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Keywords
DocType
ISSN
high-throughput convolutional neural network,cloud services,Amazon AWS,Microsoft Azure,data transfer bandwidth,network layers on-chip,point-wise approximation,binary weight quantization,CNN inference accelerator,JPEG compression,cloud-based FPGA,FPGA-based CNN accelerator,transfer image compression,JPEG coding,pipelined architecture map,CNN optimization technique,channel shift,Xilinx Virtex UltraScale+ XCVU9P FPGA,RGB data transfer
Conference
2576-2613
ISBN
Citations 
PageRank 
978-1-7281-5804-4
1
0.40
References 
Authors
14
3
Name
Order
Citations
PageRank
Hiroki Nakahara115537.34
Zhiqiang Que2269.81
Wayne Luk33752438.09