PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning - Citegraph

Paper Info

Title
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

Abstract
With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing Deep Neural Networks (DNNs) inference is still challenging considering the high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we advance the state-of-the-art by introducing a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in the design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique---pattern-based pruning based on an extended ADMM solution framework---and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning. Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5X, 11.4X, and 7.1X, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

Year	DOI	Venue
2020	10.1145/3373376.3378534	ASPLOS '20: Architectural Support for Programming Languages and Operating Systems Lausanne Switzerland March, 2020
Keywords	DocType	ISBN
Deep Neural Network,Model Compression,Compiler Optimization,Mobile Devices	Conference	978-1-4503-7102-5
Citations	PageRank	References
13	0.57	30
Authors
8

Authors (8 rows)

Cited by (13 rows)

References (30 rows)

Name	Order	Citations	PageRank
Wei Niu	1	24	11.21
Xiaolong Ma	2	22	5.90
Sheng Lin	3	139	14.39
Shihao Wang	4	62	13.33
Xuehai Qian	5	320	27.71
Xue Lin	6	86	14.97
Yanzhi Wang	7	1082	136.11
Bin Ren	8	82	18.03

1