MixKD: Towards Efficient Distillation of Large-scale Language Models - Citegraph

Paper Info

Title
MixKD: Towards Efficient Distillation of Large-scale Language Models

Abstract
Large-scale language models have demonstrated impressive empirical performance in recent years. Nevertheless, the improved results are attained at the price of bigger size, more power consumption, and slower inference, which hinder their applicability to low-resource (memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorizing training instances, and thus tend to make inconsistent predictions when the data distribution is slightly altered. Moreover, the student model has few opportunities to request useful information from teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages Mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic teacher\u0027s behaviour on the linear interpolations of example pairs as well. We prove, from a theoretical perspective, that MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct extensive experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

Year	Venue	DocType
2021	ICLR	Conference
Citations	PageRank	References
0	0.34	0
Authors
7

Authors (7 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Kevin J. Liang	1	2	4.42
Weituo Hao	2	0	1.01
Dinghan Shen	3	108	10.37
Yufan Zhou	4	0	1.69
Weizhu Chen	5	597	38.77
Changyou Chen	6	365	36.95
L. Carin	7	4603	339.36

1