Abstract | ||
---|---|---|
The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1186/1471-2105-11-85 | BMC Bioinformatics |
Keywords | Field | DocType |
document retrieval,computational biology,microarrays,algorithms,bioinformatics,data mining,software systems,finite state automaton,text mining | Information retrieval,Identifier,Computer science,Automaton,Optical character recognition,Species name,Software system,Software,Heuristics,Document retrieval,Bioinformatics | Journal |
Volume | Issue | ISSN |
11 | 1 | 1471-2105 |
Citations | PageRank | References |
99 | 3.65 | 32 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Martin Gerner | 1 | 232 | 9.98 |
Goran Nenadic | 2 | 228 | 13.18 |
Casey M Bergman | 3 | 432 | 33.52 |