Self-training Improves Pre-training for Natural Language Understanding - Citegraph

Paper Info

Title
Self-training Improves Pre-training for Natural Language Understanding

Abstract
Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

Year	Venue	DocType
2021	NAACL-HLT	Conference
Citations	PageRank	References
0	0.34	0
Authors
8

Authors (8 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Jingfei Du	1	19	4.47
Grave, Edouard	2	860	33.43
Beliz Gunel	3	0	0.68
Vishrav Chaudhary	4	8	8.26
Onur Celebi	5	0	0.34
Michael Auli	6	1061	53.54
Veselin Stoyanov	7	769	38.32
Alexis Conneau	8	342	15.03

1