Title
Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs
Abstract
Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.
Year
DOI
Venue
2019
10.1109/FCCM.2019.00035
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Keywords
Field
DocType
Field programmable gate arrays,System-on-chip,Graphics processing units,Computational modeling,Real-time systems,Hazards,Throughput
Stratix,System on a chip,Latency (engineering),Computer science,Efficient energy use,Parallel computing,Field-programmable gate array,Application-specific integrated circuit,Bandwidth (signal processing),Throughput,Embedded system
Conference
ISBN
Citations 
PageRank 
978-1-7281-1131-5
6
0.60
References 
Authors
0
16
Name
Order
Citations
PageRank
Eriko Nurvitadhi139933.08
Dongup Kwon2254.92
Ali Jafari3437.04
Andrew Boutros483.02
Jaewoong Sim538417.25
Phillip Tomson660.94
Huseyin Sumbul762.29
Gregory K. Chen8221.64
Phil Knag960.60
Raghavan Kumar1060.60
Ram Krishnamurthy1165074.63
Sergey Gribok1293.78
Bogdan Pasca1332528.69
Martin Langhammer1410420.22
Debbie Marr1517512.39
Aravind Dasu16104.47