Title | ||
---|---|---|
OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents. |
Abstract | ||
---|---|---|
Motivation: Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation. Results: We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1093/bioinformatics/btr452 | BIOINFORMATICS |
Field | DocType | Volume |
Data mining,Ontology,Information retrieval,End user,Computer science,Linked data,Bioinformatics,Documentation,Unified Medical Language System,Semantics,Tag system,Organism | Journal | 27 |
Issue | ISSN | Citations |
19 | 1367-4803 | 5 |
PageRank | References | Authors |
0.45 | 11 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nona Naderi | 1 | 20 | 6.27 |
Thomas Kappler | 2 | 38 | 3.80 |
christopher j o baker | 3 | 329 | 30.96 |
René Witte | 4 | 172 | 16.93 |