Abstract | ||
---|---|---|
Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning. |
Year | Venue | DocType |
---|---|---|
2021 | NAACL-HLT | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jingfei Du | 1 | 19 | 4.47 |
Grave, Edouard | 2 | 860 | 33.43 |
Beliz Gunel | 3 | 0 | 0.68 |
Vishrav Chaudhary | 4 | 8 | 8.26 |
Onur Celebi | 5 | 0 | 0.34 |
Michael Auli | 6 | 1061 | 53.54 |
Veselin Stoyanov | 7 | 769 | 38.32 |
Alexis Conneau | 8 | 342 | 15.03 |