Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? - Citegraph

Paper Info

Title
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Abstract
Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today's GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming Intel® 14-nm Stratix? 10 FPGAs will have thousands of hard floating-point units (DSPs) and on-chip RAMs (M20K memory blocks). They will also have high bandwidth memories (HBMs) and improved frequency (HyperFlex? core architecture). This combination of features brings FPGA raw floating point performance within striking distance of GPUs. Meanwhile, DNNs are quickly evolving. For example, recent innovations that exploit sparsity (e.g., pruning) and compact data types (e.g., 1-2 bit) result in major leaps in algorithmic efficiency. However, these innovations introduce irregular parallelism on custom data types, which are difficult for GPUs to handle but would be a great fit for FPGA's extreme customizability. This paper evaluates a selection of emerging DNN algorithms on two generations of Intel FPGAs (Arria'10, Stratix'10) against the latest highest performance Titan X Pascal GPU. We created a customizable DNN accelerator template for FPGAs and used it in our evaluations. First, we study various GEMM operations for next-generation DNNs. Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. Then, we present a detailed case study on accelerating Ternary ResNet which relies on sparse GEMM on 2-bit weights (i.e., weights constrained to 0,+1,-1) and full-precision neurons. The Ternary ResNet accuracy is within ~1% of the full-precision ResNet which won the 2015 ImageNet competition. On Ternary-ResNet, the Stratix 10 FPGA can deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating next-generation DNNs.

Year	DOI	Venue
2017	10.1145/3020078.3021740	FPGA
Keywords	Field	DocType
Deep Learning,Accelerator,Intel Stratix 10 FPGA,GPU	Stratix,Algorithmic efficiency,Efficient energy use,Computer science,Floating point,Parallel computing,Field-programmable gate array,Real-time computing,Data type,Artificial intelligence,Deep learning,Matrix multiplication	Conference
Citations	PageRank	References
60	2.55	13
Authors
11

Authors (11 rows)

Cited by (60 rows)

References (13 rows)

Name	Order	Citations	PageRank
Eriko Nurvitadhi	1	399	33.08
Ganesh Venkatesh	2	274	17.97
Jaewoong Sim	3	384	17.25
Debbie Marr	4	175	12.39
Randy Huang	5	292	28.48
Jason Ong Gee Hock	6	60	2.55
Yeong Tat Liew	7	60	2.55
Srivatsan Krishnan	8	96	6.86
Duncan J. M. Moss	9	91	7.74
Suchit Subhaschandra	10	82	5.50
Guy Boudoukh	11	60	2.55

1