Title
BioCreative V CDR task corpus: a resource for chemical disease relation extraction.
Abstract
Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.
Year
DOI
Venue
2016
10.1093/database/baw068
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
Field
DocType
Volume
Data mining,Annotation,Information retrieval,Identifier,Computer science,Text corpus,Controlled vocabulary,Jaccard index,Bioinformatics,Named-entity recognition,Relationship extraction,Test set
Journal
2016
ISSN
Citations 
PageRank 
1758-0463
36
1.39
References 
Authors
23
10
Name
Order
Citations
PageRank
Jiao Li1361.39
Yueping Sun2361.39
Robin J. Johnson3361.39
Daniela Sciaky4361.39
Chih-Hsuan Wei554627.43
Robert Leaman691439.98
Allan Peter Davis744424.76
Carolyn J. Mattingly849529.93
Thomas C. Wiegers960330.77
Zhiyong Lu102735171.27