TinyBERT: Distilling BERT for Natural Language Understanding - Citegraph

Paper Info

Title
TinyBERT: Distilling BERT for Natural Language Understanding

Abstract
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines, even with only about 28% parameters and 31% inference time of baselines.

Year	DOI	Venue
2020	10.18653/V1/2020.FINDINGS-EMNLP.372	EMNLP
DocType	Volume	Citations
Conference	2020.findings-emnlp	2
PageRank	References	Authors
0.35	0	8

Authors (8 rows)

Cited by (2 rows)

References (0 rows)

Name	Order	Citations	PageRank
Jiao Xiaoqi	1	2	0.35
Yin Yichun	2	2	0.35
Lifeng Shang	3	485	30.96
Xin Jiang	4	150	32.43
Chen Xiao	5	2	0.69
Li Linlin	6	2	0.35
Wang Fang	7	2	0.35
Qun Liu	8	2149	203.11

1