Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs - Citegraph

Paper Info

Title
Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs

Abstract
We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel® Arria® 10, Stratix® 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip.

Year	DOI	Venue
2019	10.1109/ICFPT47387.2019.00054	2019 International Conference on Field-Programmable Technology (ICFPT)
Keywords	Field	DocType
AI, multi-FPGA server, neural machine translation	Stratix,Computer science,Inference,Latency (engineering),Machine translation,Parallel computing,Beam search,Field-programmable gate array,Latency (engineering),Scalability	Conference
ISBN	Citations	PageRank
978-1-7281-2944-0	0	0.34
References	Authors
2	10

Authors (10 rows)

Cited by (0 rows)

References (2 rows)

Name	Order	Citations	PageRank
Eriko Nurvitadhi	1	399	33.08
Mishali Naik	2	0	0.34
Andrew Boutros	3	8	3.02
Prerna Budhkar	4	0	0.34
Ali Jafari	5	43	7.04
Dongup Kwon	6	25	4.92
David Sheffield	7	33	3.54
Abirami Prabhakaran	8	0	0.34
Karthik Gururaj	9	0	0.34
Pranavi Appana	10	0	0.34

1