Abstract | ||
---|---|---|
Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8 times shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at https://github.com/libeineu/GPKD. |
Year | Venue | DocType |
---|---|---|
2021 | THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | Conference |
Volume | ISSN | Citations |
35 | 2159-5399 | 0 |
PageRank | References | Authors |
0.34 | 0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Bei Li | 1 | 1 | 3.06 |
Ziyang Wang | 2 | 0 | 0.68 |
Y. H. Liu | 3 | 40 | 16.40 |
Quan Du | 4 | 0 | 1.01 |
Tong Xiao | 5 | 131 | 23.91 |
Chunliang Zhang | 6 | 50 | 8.30 |
Jingbo Zhu | 7 | 703 | 64.21 |