Title
Mining chemical patents with an ensemble of open systems.
Abstract
The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach.
Year
DOI
Venue
2016
10.1093/database/baw065
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
Field
DocType
Volume
Data mining,Text mining,Computer science,Bioinformatics,Statistical classification,Open system (systems theory),Named-entity recognition
Journal
2016
ISSN
Citations 
PageRank 
1758-0463
4
0.46
References 
Authors
19
4
Name
Order
Citations
PageRank
Robert Leaman191439.98
Chih-Hsuan Wei254627.43
Cherry Zou340.46
Zhiyong Lu42735171.27