Title
PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferences
Abstract
BSTRACTIn emerging DNN serving systems, queries are usually batched to fully leverage hardware resources, and all the queries in a batch run through the complete model and return at the same time. According to our findings, some queries only need to pass through a portion of the DNN model to attain sufficient precision in a DNN service. These queries can have shorter latencies if they can return early in the middle of a model. Therefore, we propose precision-aware multi-exit inference serving, PAME, to achieve the above purpose. PAME provides a holistic scheme to build a multi-exit DNN model and a corresponding system-level design of the inference engine. We use representative CV and NLP benchmarks to evaluate PAME. PAME is adaptive to various DNN tasks and service loads. Experimental results show that PAME reduces 39.9% average latency without increasing the tail latency, while maintaining 99.68% precision of the original single-exit DNN models on average.
Year
DOI
Venue
2022
10.1145/3524059.3532366
International Conference on Supercomputing
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
8
Name
Order
Citations
PageRank
Shulai Zhang100.68
Weihao Cui2133.27
Quan Chen317521.86
Zhengnian Zhang400.34
Yue Guan511.70
Jingwen Leng64912.97
Chao Li734437.85
Minyi Guo83969332.25