Title
The CHEMDNER corpus of chemicals and drugs and its annotation principles
Abstract
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
Year
DOI
Venue
2015
10.1186/1758-2946-7-S1-S2
JOURNAL OF CHEMINFORMATICS
Keywords
Field
DocType
named entity recognition, BioCreative, text mining, chemical entity recognition, machine learning, chemical indexing, ChemNLP
Data mining,Annotation,Information retrieval,Identifier,Computer science,Text corpus,Bioinformatics,Named-entity recognition
Journal
Volume
Issue
ISSN
7
S1
1758-2946
Citations 
PageRank 
References 
72
2.17
36
Authors
53
Name
Order
Citations
PageRank
Martin Krallinger176335.65
Obdulia Rabal21628.61
Florian Leitner336214.92
Miguel Vazquez42288.54
David Salgado52469.78
Zhiyong Lu62735171.27
Robert Leaman791439.98
Yanan Lu8974.02
Donghong Ji9892120.08
Daniel M Lowe10722.17
Roger A Sayle11722.50
Riza Theresa Batista-Navarro12722.17
Rafal Rak1338218.30
Torsten Huber14722.17
Tim Rocktäschel15722.17
Sérgio Matos1641529.51
David Campos17722.17
Buzhou Tang1836834.04
Hua Xu19773.27
Tsendsuren Munkhdalai2016913.49
Keun Ho Ryu21814.39
S V Ramanan22722.17
Senthil Nathan231225.37
Slavko Zitnik24946.68
Marko Bajec2546534.56
Lutz Weber26762.60
Matthias Irmer27722.17
Saber A. Akhondi281249.40
Jan A Kors29722.17
Shuo Xu30722.17
Xin An31722.50
Utpal Kumar Sikdar32722.17
Asif Ekbal33737119.31
Masaharu Yoshioka3436841.40
Thaer M Dieb35722.17
Miji Choi36773.39
Karin Verspoor3799378.54
Madian Khabsa3823718.81
C. Lee Giles39111541549.48
Hongfang Liu401479160.66
Komandur Elayavilli Ravikumar41722.17
Andre Lamurias42839.32
Francisco M. Couto4398272.63
Hong-Jie Dai4428821.58
Richard Tzong-Han Tsai45722.84
Caglar Ata46722.17
Tolga Can4726816.39
Anabel Usié481003.88
Rui Alves4919632.99
Isabel Segura-Bedmar5043530.96
  • 1
  • 2