Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models. - Citegraph

Paper Info

Title
Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models.

Abstract
Rare word representation has recently enjoyed a surge of interest, owing to the crucial role that effective handling of infrequent words can play in accurate semantic understanding. However, there is a paucity of reliable benchmarks for evaluation and comparison of these techniques. We show in this paper that the only existing benchmark (the Stanford Rare Word dataset) suffers from low-confidence annotations and limited vocabulary; hence, it does not constitute a solid comparison framework. In order to fill this evaluation gap, we propose CAmbridge Rare word Dataset (Card-660), an expert-annotated word similarity dataset which provides a highly reliable, yet challenging, benchmark for rare word representation techniques. Through a set of experiments we show that even the best mainstream word embeddings, with millions of words in their vocabularies, are unable to achieve performances higher than 0.43 (Pearson correlation) on the dataset, compared to a human-level upperbound of 0.90. We release the dataset and the annotation materials at https://pilehvar.github.io/card-660/.

Year	DOI	Venue
2018	10.18653/v1/d18-1169	Empirical Methods in Natural Language Processing
Field	DocType	Volume
Pearson product-moment correlation coefficient,Word representation,Annotation,Computer science,Natural language processing,Artificial intelligence,Vocabulary	Journal	abs/1808.09308
Citations	PageRank	References
0	0.34	23
Authors
4

Authors (4 rows)

Cited by (0 rows)

References (23 rows)

Name	Order	Citations	PageRank
Mohammad Taher Pilehvar	1	376	25.70
Dimitri Kartsaklis	2	204	15.08
Victor Prokhorov	3	2	2.38
Nigel Collier	4	18	5.07

1