Uni-OPU : An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks - Citegraph

Paper Info

Title
Uni-OPU : An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks

Abstract
In this article, we design the first full software/ hardware stack, called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic> , for an efficient uniform hardware acceleration of different types of transposed convolutional (TCONV) networks and conventional convolutional (CONV) networks. Specifically, a software compiler is provided to transform the computation of various TCONV, i.e., zero-inserting-based TCONV (zero-TCONV), nearest-neighbor resizing-based TCONV (NN-TCONV), and CONV layers into the same pattern. The compiler conducts the following optimizations: 1) eliminating up to 98.4% of operations in TCONV by making use of the fixed pattern of TCONV upsampling; 2) decomposing and reformulating TCONV and CONV into streaming parallel vector multiplication with a uniform address generation scheme and data flow pattern; and 3) efficient scheduling and instruction compilation to map networks onto a hardware processor. An instruction-based hardware acceleration processor is developed to efficiently speedup our uniform computation pattern with throughput up to 2.35 TOPS for the TCONV layer, consuming only 2.89 W dynamic power. We evaluate <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic> on a benchmark set composed of six TCONV networks from different application fields. Extensive experimental results indicate that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic> is able to gain <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.45 \times $ </tex-math></inline-formula> to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.68 \times $ </tex-math></inline-formula> superior power efficiency compared with state-of-the-art zero-TCONV accelerators. High acceleration performance is also achieved on NN-TCONV networks, the acceleration of which have not been explored before. In summary, we observe <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.90 \times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.63 \times $ </tex-math></inline-formula> latency reduction, as well as <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$15.04 \times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$12.43 \times $ </tex-math></inline-formula> higher power efficiency on zero-TCONV and NN-TCONV networks compared with Titan Xp GPU on average. To the best of our knowledge, ours is the first in-depth study to completely unify the computation process of zero-TCONV, NN-TCONV, and CONV layers.

Year	DOI	Venue
2020	10.1109/TVLSI.2020.2995741	IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Keywords	DocType	Volume
Artificial neural networks,Hardware,Frequency modulation,Acceleration,Convolution,Field programmable gate arrays,Kernel	Journal	28
Issue	ISSN	Citations
7	1063-8210	4
PageRank	References	Authors
0.47	0	5

Authors (5 rows)

Cited by (4 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yunxuan Yu	1	14	2.52
Tiandong Zhao	2	14	2.18
Mingyu Wang	3	135	24.90
Kun Wang	4	364	30.23
Lei He	5	1015	86.74

1