Title
A document processing pipeline for annotating chemical entities in scientific documents.
Abstract
The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text.We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task.We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/.
Year
DOI
Venue
2015
10.1186/1758-2946-7-S1-S7
J. Cheminformatics
Keywords
Field
DocType
chemicals,conditional random fields,named entity recognition,biomedical research,bioinformatics
Conditional random field,Data mining,Text mining,Information retrieval,Computer science,Document processing,Information extraction,Bioinformatics,Named-entity recognition
Journal
Volume
Issue
ISSN
7
Suppl 1 Text mining for chemistry and the CHEMDNER track
1758-2946
Citations 
PageRank 
References 
6
0.45
20
Authors
3
Name
Order
Citations
PageRank
David Campos121910.69
Sérgio Matos241529.51
José Luis Oliveira376084.03