Abstract | ||
---|---|---|
We present AnonData, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We tested the performance of two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art NER GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%). GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%) and demonstrates the difficulty of our dataset. AnonData poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems. |
Year | Venue | DocType |
---|---|---|
2022 | International Conference on Computational Linguistics | Conference |
Volume | Citations | PageRank |
Proceedings of the 29th International Conference on Computational Linguistics | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shervin Malmasi | 1 | 0 | 1.01 |
Anjie Fang | 2 | 0 | 0.68 |
Besnik Fetahu | 3 | 1 | 1.04 |
Sudipta Kar | 4 | 5 | 4.14 |
Oleg Rokhlenko | 5 | 1 | 1.36 |