Abstract | ||
---|---|---|
ABSTRACTThis paper proposes Mandheling, the first system that enables highly resource-efficient on-device training by orchestrating mixed-precision training with on-chip Digital Signal Processor (DSP) offloading. Mandheling fully explores the advantages of DSP in integer-based numerical calculations using four novel techniques: (1) a CPU-DSP co-scheduling scheme to situationally mitigate the overhead from DSP-unfriendly operators; (2) a self-adaptive rescaling algorithm to reduce the overhead of dynamic rescaling in backward propagation; (3) a batch-splitting algorithm to improve DSP cache efficiency; (4) a DSP compute subgraph-reusing mechanism to eliminate the preparation overhead on DSP. We have fully implemented Mandheling and demonstrated its effectiveness through extensive experiments. The results show that, compared to the state-of-the-art DNN engines from TFLite and MNN, Mandheling reduces per-batch training time by 5.5X and energy consumption by 8.9X on average. In end-to-end training tasks, Mandheling reduces convergence time by up to 10.7X and energy consumption by 13.1X, with only 1.9%--2.7% accuracy loss compared to the FP32 precision setting. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1145/3495243.3560545 | Mobile Computing and Networking |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 9 |
Name | Order | Citations | PageRank |
---|---|---|---|
Daliang Xu | 1 | 0 | 0.34 |
Mengwei Xu | 2 | 66 | 8.32 |
Qipeng Wang | 3 | 0 | 0.34 |
Shangguang Wang | 4 | 816 | 88.84 |
Yun Ma | 5 | 216 | 20.25 |
Kang Huang | 6 | 0 | 0.34 |
Gang Huang | 7 | 1223 | 110.80 |
Xin Jin | 8 | 0 | 0.34 |
Xuanzhe Liu | 9 | 689 | 57.53 |