PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences - Citegraph

Paper Info

Title
PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences

Abstract
BSTRACTIn emerging DNN serving systems, queries are usually batched to fully leverage hardware resources, and all the queries in a batch run through the complete model and return at the same time. According to our findings, some queries only need to pass through a portion of the DNN model to attain sufficient precision in a DNN service. These queries can have shorter latencies if they can return early in the middle of a model. Therefore, we propose precision-aware multi-exit inference serving, PAME, to achieve the above purpose. PAME provides a holistic scheme to build a multi-exit DNN model and a corresponding system-level design of the inference engine. We use representative CV and NLP benchmarks to evaluate PAME. PAME is adaptive to various DNN tasks and service loads. Experimental results show that PAME reduces 39.9% average latency without increasing the tail latency, while maintaining 99.68% precision of the original single-exit DNN models on average.

Year	DOI	Venue
2022	10.1145/3524059.3532366	International Conference on Supercomputing
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	8

Authors (8 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Shulai Zhang	1	0	0.68
Weihao Cui	2	13	3.27
Quan Chen	3	175	21.86
Zhengnian Zhang	4	0	0.34
Yue Guan	5	1	1.70
Jingwen Leng	6	49	12.97
Chao Li	7	344	37.85
Minyi Guo	8	3969	332.25

1