Title
Uni-OPU : An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks
Abstract
In this article, we design the first full software/ hardware stack, called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic> , for an efficient uniform hardware acceleration of different types of transposed convolutional (TCONV) networks and conventional convolutional (CONV) networks. Specifically, a software compiler is provided to transform the computation of various TCONV, i.e., zero-inserting-based TCONV (zero-TCONV), nearest-neighbor resizing-based TCONV (NN-TCONV), and CONV layers into the same pattern. The compiler conducts the following optimizations: 1) eliminating up to 98.4% of operations in TCONV by making use of the fixed pattern of TCONV upsampling; 2) decomposing and reformulating TCONV and CONV into streaming parallel vector multiplication with a uniform address generation scheme and data flow pattern; and 3) efficient scheduling and instruction compilation to map networks onto a hardware processor. An instruction-based hardware acceleration processor is developed to efficiently speedup our uniform computation pattern with throughput up to 2.35 TOPS for the TCONV layer, consuming only 2.89 W dynamic power. We evaluate <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic> on a benchmark set composed of six TCONV networks from different application fields. Extensive experimental results indicate that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic> is able to gain <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.45 \times $ </tex-math></inline-formula> to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.68 \times $ </tex-math></inline-formula> superior power efficiency compared with state-of-the-art zero-TCONV accelerators. High acceleration performance is also achieved on NN-TCONV networks, the acceleration of which have not been explored before. In summary, we observe <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.90 \times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.63 \times $ </tex-math></inline-formula> latency reduction, as well as <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$15.04 \times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$12.43 \times $ </tex-math></inline-formula> higher power efficiency on zero-TCONV and NN-TCONV networks compared with Titan Xp GPU on average. To the best of our knowledge, ours is the first in-depth study to completely unify the computation process of zero-TCONV, NN-TCONV, and CONV layers.
Year
DOI
Venue
2020
10.1109/TVLSI.2020.2995741
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Keywords
DocType
Volume
Artificial neural networks,Hardware,Frequency modulation,Acceleration,Convolution,Field programmable gate arrays,Kernel
Journal
28
Issue
ISSN
Citations 
7
1063-8210
4
PageRank 
References 
Authors
0.47
0
5
Name
Order
Citations
PageRank
Yunxuan Yu1142.52
Tiandong Zhao2142.18
Mingyu Wang313524.90
Kun Wang436430.23
Lei He5101586.74