A document processing pipeline for annotating chemical entities in scientific documents. - Citegraph

Paper Info

Title
A document processing pipeline for annotating chemical entities in scientific documents.

Abstract
The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text.We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task.We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/.

Year	DOI	Venue
2015	10.1186/1758-2946-7-S1-S7	J. Cheminformatics
Keywords	Field	DocType
chemicals,conditional random fields,named entity recognition,biomedical research,bioinformatics	Conditional random field,Data mining,Text mining,Information retrieval,Computer science,Document processing,Information extraction,Bioinformatics,Named-entity recognition	Journal
Volume	Issue	ISSN
7	Suppl 1 Text mining for chemistry and the CHEMDNER track	1758-2946
Citations	PageRank	References
6	0.45	20
Authors
3

Authors (3 rows)

Cited by (6 rows)

References (20 rows)

Name	Order	Citations	PageRank
David Campos	1	219	10.69
Sérgio Matos	2	415	29.51
José Luis Oliveira	3	760	84.03

1