Title
Probability Distillation - A Caveat and Alternatives.
Abstract
Due to Van den Oord et al. (2018), probability distillation has recently been of interest to deep learning practitioners, where, as a practical workaround for deploying autoregressive models in real-time applications, a student net-work is used to obtain quality samples in parallel. We identify a pathological optimization issue with the adopted stochastic minimization of the reverse-KL divergence: the curse of dimensionality results in a skewed gradient distribution that renders training inefficient. This means that KL-based "evaluative" training can be susceptible to poor exploration if the target distribution is highly structured. We then explore alternative principles for distillation, including one with an "instructive" signal, and show that it is possible to achieve qualitatively better results than with KL minimization.
Year
Venue
Field
2019
UAI
Mathematical optimization,Computer science,Distillation
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Chin-Wei Huang185.18
Faruk Ahmed200.34
Kundan Kumar3105.89
Alexandre Lacoste414713.05
Aaron C. Courville56671348.46