Abstract | ||
---|---|---|
ABSTRACTIn this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.) |
Year | DOI | Venue |
---|---|---|
2021 | 10.1145/3431920.3439461 | International Symposium on Field Programmable Gate Arrays |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Beilei Jiang | 1 | 6 | 1.99 |
Xianwei Cheng | 2 | 6 | 3.00 |
Sihai Tang | 3 | 16 | 4.54 |
Xu Ma | 4 | 0 | 2.37 |
Zhaochen Gu | 5 | 0 | 1.35 |
Hui Zhao | 6 | 113 | 11.73 |
Song Fu | 7 | 448 | 35.66 |