Bridging Adversarial Robustness and Gradient Interpretability. - Citegraph

Paper Info

Title
Bridging Adversarial Robustness and Gradient Interpretability.

Abstract
Adversarial training is a training scheme designed to counter adversarial attacks by augmenting the training dataset with adversarial examples. Surprisingly, several studies have observed that loss gradients from adversarially trained DNNs are visually more interpretable than those from standard DNNs. Although this phenomenon is interesting, there are only few works that have offered an explanation. In this paper, we attempted to bridge this gap between adversarial robustness and gradient interpretability. To this end, we identified that loss gradients from adversarially trained DNNs align better with human perception because adversarial training restricts gradients closer to the image manifold. We then demonstrated that adversarial training causes loss gradients to be quantitatively meaningful. Finally, we showed that under the adversarial training framework, there exists an empirical trade-off between test accuracy and loss gradient interpretability and proposed two potential approaches to resolving this trade-off.

Year	Venue	DocType
2019	arXiv: Learning	Journal
Volume	Citations	PageRank
abs/1903.11626	0	0.34
References	Authors
0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Beomsu Kim	1	3	2.79
Junghoon Seo	2	0	1.01
Taegyun Jeon	3	4	3.46

1