Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs - Citegraph

Paper Info

Title
Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

Abstract
Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.

Year	DOI	Venue
2019	10.1109/FCCM.2019.00035	2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Keywords	Field	DocType
Field programmable gate arrays,System-on-chip,Graphics processing units,Computational modeling,Real-time systems,Hazards,Throughput	Stratix,System on a chip,Latency (engineering),Computer science,Efficient energy use,Parallel computing,Field-programmable gate array,Application-specific integrated circuit,Bandwidth (signal processing),Throughput,Embedded system	Conference
ISBN	Citations	PageRank
978-1-7281-1131-5	6	0.60
References	Authors
0	16

Authors (16 rows)

Cited by (6 rows)

References (0 rows)

Name	Order	Citations	PageRank
Eriko Nurvitadhi	1	399	33.08
Dongup Kwon	2	25	4.92
Ali Jafari	3	43	7.04
Andrew Boutros	4	8	3.02
Jaewoong Sim	5	384	17.25
Phillip Tomson	6	6	0.94
Huseyin Sumbul	7	6	2.29
Gregory K. Chen	8	22	1.64
Phil Knag	9	6	0.60
Raghavan Kumar	10	6	0.60
Ram Krishnamurthy	11	650	74.63
Sergey Gribok	12	9	3.78
Bogdan Pasca	13	325	28.69
Martin Langhammer	14	104	20.22
Debbie Marr	15	175	12.39
Aravind Dasu	16	10	4.47

1