Title
Inter-species normalization of gene mentions with GNAT.
Abstract
Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words.We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes.A web-frontend is available at http://cbioc.eas.asu.edu/gnat/. GNAT will also be available within the BioCreativeMetaService project, see http://bcms.bioinfo.cnio.es.The test data set, lexica, and links toexternal data are available at http://cbioc.eas.asu.edu/gnat/
Year
DOI
Venue
2008
10.1093/bioinformatics/btn299
ECCB
Keywords
Field
DocType
subsequent normalization,gene mention normalization,inter-species normalization,detailed information,external data,available system,normalization facilitates indexing,test data,information retrieval perspective,biocreative metaservice project,supplementary information
Data mining,Normalization (statistics),Information retrieval,Identifier,Computer science,Gnat,Search engine indexing,Bioinformatics,Ambiguity,Gene nomenclature
Conference
Volume
Issue
ISSN
24
16
1367-4811
Citations 
PageRank 
References 
60
2.55
13
Authors
5
Name
Order
Citations
PageRank
Jörg Hakenberg147223.88
Conrad Plake227213.22
Robert Leaman391439.98
Michael Schroeder448026.58
Graciela Gonzalez562439.60