Abstract | ||
---|---|---|
Knowledge distillation describes a method for training a student network to perform better by learning from a stronger teacher network. In this work, we run experiments with different kinds of teacher net- works to enhance the translation performance of a student Neural Machine Translation (NMT) network. We demonstrate techniques based on an ensemble and a best BLEU teacher network. We also show how to benefit from a teacher network that has the same architecture and dimensions of the student network. Further- more, we introduce a data filtering technique based on the dissimilarity between the forward translation (obtained during knowledge distillation) of a given source sentence and its target reference. We use TER to measure dissimilarity. Finally, we show that an ensemble teacher model can significantly reduce the student model size while still getting performance improvements compared to the baseline student network. |
Year | Venue | Field |
---|---|---|
2017 | arXiv: Computation and Language | Data filtering,Computer science,Machine translation,Oracle,Artificial intelligence,Natural language processing,Speedup,Architecture,Speech recognition,Distillation,Decoding methods,Sentence,Machine learning |
DocType | Volume | Citations |
Journal | abs/1702.01802 | 3 |
PageRank | References | Authors |
0.40 | 5 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Markus Freitag | 1 | 86 | 15.28 |
Yaser Al-Onaizan | 2 | 540 | 38.51 |
Baskaran Sankaran | 3 | 155 | 13.65 |