Title | ||
---|---|---|
Uni-OPU : An FPGA-Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks |
Abstract | ||
---|---|---|
In this article, we design the first full software/ hardware stack, called
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic>
, for an efficient uniform hardware acceleration of different types of transposed convolutional (TCONV) networks and conventional convolutional (CONV) networks. Specifically, a software compiler is provided to transform the computation of various TCONV, i.e., zero-inserting-based TCONV (zero-TCONV), nearest-neighbor resizing-based TCONV (NN-TCONV), and CONV layers into the same pattern. The compiler conducts the following optimizations: 1) eliminating up to 98.4% of operations in TCONV by making use of the fixed pattern of TCONV upsampling; 2) decomposing and reformulating TCONV and CONV into streaming parallel vector multiplication with a uniform address generation scheme and data flow pattern; and 3) efficient scheduling and instruction compilation to map networks onto a hardware processor. An instruction-based hardware acceleration processor is developed to efficiently speedup our uniform computation pattern with throughput up to 2.35 TOPS for the TCONV layer, consuming only 2.89 W dynamic power. We evaluate
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic>
on a benchmark set composed of six TCONV networks from different application fields. Extensive experimental results indicate that
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Uni-OPU</italic>
is able to gain
<inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.45 \times $ </tex-math></inline-formula>
to
<inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.68 \times $ </tex-math></inline-formula>
superior power efficiency compared with state-of-the-art zero-TCONV accelerators. High acceleration performance is also achieved on NN-TCONV networks, the acceleration of which have not been explored before. In summary, we observe
<inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.90 \times $ </tex-math></inline-formula>
and
<inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.63 \times $ </tex-math></inline-formula>
latency reduction, as well as
<inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$15.04 \times $ </tex-math></inline-formula>
and
<inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$12.43 \times $ </tex-math></inline-formula>
higher power efficiency on zero-TCONV and NN-TCONV networks compared with Titan Xp GPU on average. To the best of our knowledge, ours is the first in-depth study to completely unify the computation process of zero-TCONV, NN-TCONV, and CONV layers. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/TVLSI.2020.2995741 | IEEE Transactions on Very Large Scale Integration (VLSI) Systems |
Keywords | DocType | Volume |
Artificial neural networks,Hardware,Frequency modulation,Acceleration,Convolution,Field programmable gate arrays,Kernel | Journal | 28 |
Issue | ISSN | Citations |
7 | 1063-8210 | 4 |
PageRank | References | Authors |
0.47 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yunxuan Yu | 1 | 14 | 2.52 |
Tiandong Zhao | 2 | 14 | 2.18 |
Mingyu Wang | 3 | 135 | 24.90 |
Kun Wang | 4 | 364 | 30.23 |
Lei He | 5 | 1015 | 86.74 |