E<sup>2</sup>bird: <underline>E</underline>nhanced <underline>E</underline>lastic <underline>B</underline>atch for <underline>I</underline>mproving <underline>R</underline>esponsiveness and Throughput of <underline>D</underline>eep Learning Services - Citegraph

Paper Info

Title
E<sup>2</sup>bird: <underline>E</underline>nhanced <underline>E</underline>lastic <underline>B</underline>atch for <underline>I</underline>mproving <underline>R</underline>esponsiveness and Throughput of <underline>D</underline>eep Learning Services

Abstract
We aim to tackle existing problems about deep learning serving on GPUs in the view of the system. GPUs have been widely adopted to serve online deep learning-based services that have stringent QoS(Quality-of-Service) requirements. However, emerging deep learning serving systems often result in poor responsiveness and low throughput of the inferences that damage user experience and increase the number of GPUs required to host an online service. Our investigation shows that the poor batching operation and the lack of data transfer-computation overlap are the root causes of the poor responsiveness and low throughput. To this end, we propose E <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> bird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler. The memory pool eliminates the unnecessary waiting of the batching operation and enables data transfer-computation overlap. The inference engine enables concurrent execution of different batches, improving the GPU resource utilization. The batch scheduler organizes inferences elasticallyto guarantee the QoS. Our experimental results on an Nvidia Titan RTXGPU show that E <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> bird reduces the response latency of inferences by up to 82.4 percent and improves the throughput by up to 62.8 percent while guaranteeing the QoS target compared with TensorFlow Serving.

Year	DOI	Venue
2021	10.1109/TPDS.2020.3047638	IEEE Transactions on Parallel and Distributed Systems
Keywords	DocType	Volume
GPUs,DL serving,latency,throughput,responsiveness	Journal	32
Issue	ISSN	Citations
6	1045-9219	3
PageRank	References	Authors
0.37	0	6

Authors (6 rows)

Cited by (3 rows)

References (0 rows)

Name	Order	Citations	PageRank
Weihao Cui	1	13	3.27
Quan Chen	2	175	21.86
Han Zhao	3	8	1.81
Mengze Wei	4	3	0.37
Xiaoxin Tang	5	6	0.79
Minyi Guo	6	3969	332.25

1