Title | ||
---|---|---|
Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs |
Abstract | ||
---|---|---|
We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel® Arria® 10, Stratix® 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/ICFPT47387.2019.00054 | 2019 International Conference on Field-Programmable Technology (ICFPT) |
Keywords | Field | DocType |
AI, multi-FPGA server, neural machine translation | Stratix,Computer science,Inference,Latency (engineering),Machine translation,Parallel computing,Beam search,Field-programmable gate array,Latency (engineering),Scalability | Conference |
ISBN | Citations | PageRank |
978-1-7281-2944-0 | 0 | 0.34 |
References | Authors | |
2 | 10 |
Name | Order | Citations | PageRank |
---|---|---|---|
Eriko Nurvitadhi | 1 | 399 | 33.08 |
Mishali Naik | 2 | 0 | 0.34 |
Andrew Boutros | 3 | 8 | 3.02 |
Prerna Budhkar | 4 | 0 | 0.34 |
Ali Jafari | 5 | 43 | 7.04 |
Dongup Kwon | 6 | 25 | 4.92 |
David Sheffield | 7 | 33 | 3.54 |
Abirami Prabhakaran | 8 | 0 | 0.34 |
Karthik Gururaj | 9 | 0 | 0.34 |
Pranavi Appana | 10 | 0 | 0.34 |