Title
Evaluating Models' Local Decision Boundaries via Contrast Sets.
Abstract
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
Year
DOI
Venue
2020
10.18653/V1/2020.FINDINGS-EMNLP.117
EMNLP
DocType
Volume
Citations 
Conference
2020.findings-emnlp
0
PageRank 
References 
Authors
0.34
0
26
Name
Order
Citations
PageRank
Matthew Gardner170438.49
Yoav Artzi248326.99
Victoria Basmova300.34
Jonathan Berant498253.86
Ben Bogin5204.06
Sihao Chen600.34
Pradeep Dasigi713112.09
Dheeru Dua8384.95
Yanai Elazar995.54
Ananth Gottumukkala1000.34
Nitish Gupta11174.70
Hannaneh Hajishirzi1241746.10
Gabriel Ilharco1352.11
Daniel Khashabi1411415.14
Kevin Lin1511.36
Jiangming Liu16196.12
Nelson Liu17424.59
Phoebe Mulcaire1831.40
Qiang Ning19189.48
Sameer Singh20106071.63
Noah A. Smith215867314.27
Sanjay Subramanian2213.78
Reut Tsarfaty2300.34
Eric Wallace24187.45
Ally Zhang2500.34
Ben Zhou2600.68