MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition. - Citegraph

Paper Info

Title
MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition.

Abstract
We present AnonData, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We tested the performance of two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art NER GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%). GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%) and demonstrates the difficulty of our dataset. AnonData poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems.

Year	Venue	DocType
2022	International Conference on Computational Linguistics	Conference
Volume	Citations	PageRank
Proceedings of the 29th International Conference on Computational Linguistics	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Shervin Malmasi	1	0	1.01
Anjie Fang	2	0	0.68
Besnik Fetahu	3	1	1.04
Sudipta Kar	4	5	4.14
Oleg Rokhlenko	5	1	1.36

1