Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media - Citegraph

Paper Info

Title
Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Abstract
Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

Year	DOI	Venue
2020	10.18653/V1/2020.FINDINGS-EMNLP.151	EMNLP
DocType	Volume	Citations
Conference	2020.findings-emnlp	0
PageRank	References	Authors
0.34	0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Xiang Dai	1	1	3.05
Sarvnaz Karimi	2	380	33.01
Ben Hachey	3	321	24.83
Cécile Paris	4	1700	243.43

1