Title
TinyBERT: Distilling BERT for Natural Language Understanding
Abstract
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines, even with only about 28% parameters and 31% inference time of baselines.
Year
DOI
Venue
2020
10.18653/V1/2020.FINDINGS-EMNLP.372
EMNLP
DocType
Volume
Citations 
Conference
2020.findings-emnlp
2
PageRank 
References 
Authors
0.35
0
8
Name
Order
Citations
PageRank
Jiao Xiaoqi120.35
Yin Yichun220.35
Lifeng Shang348530.96
Xin Jiang415032.43
Chen Xiao520.69
Li Linlin620.35
Wang Fang720.35
Qun Liu82149203.11