Using Similarity Measures to Select Pretraining Data for NER. - Citegraph

Paper Info

Title
Using Similarity Measures to Select Pretraining Data for NER.

Abstract
Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.

Year	Venue	Field
2019	arXiv: Computation and Language	Computer science,Artificial intelligence,Natural language processing,Machine learning
DocType	Volume	Citations
Journal	abs/1904.00585	1
PageRank	References	Authors
0.34	0	4

Authors (4 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Xiang Dai	1	1	3.05
Sarvnaz Karimi	2	380	33.01
Ben Hachey	3	321	24.83
Cécile Paris	4	1	0.68

1