Abstract | ||
---|---|---|
Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforce that: all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it act as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time. |
Year | DOI | Venue |
---|---|---|
2022 | 10.18653/v1/2022.findings-naacl.103 | The Annual Conference of the North American Chapter of the Association for Computational Linguistics |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Md Akmal Haidar | 1 | 0 | 0.68 |
Nithin Anchuri | 2 | 0 | 0.34 |
Mehdi Rezagholizadeh | 3 | 3 | 8.82 |
Abbas Ghaddar | 4 | 0 | 0.68 |
Philippe Langlais | 5 | 0 | 0.34 |
Pascal Poupart | 6 | 0 | 0.34 |