Abstract | ||
---|---|---|
The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (+/-3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon. |
Year | DOI | Venue |
---|---|---|
2004 | 10.1142/S0219720004000399 | J. Bioinformatics and Computational Biology |
Keywords | Field | DocType |
information extraction,pattern analysis,bioinformatics | Inductive logic programming,Morphological pattern,Gene,Computer science,Natural language,Information extraction,Lexicon,Natural language processing,Artificial intelligence,Valid name,Bioinformatics,Named-entity recognition | Journal |
Volume | Issue | ISSN |
1 | 4 | 0219-7200 |
Citations | PageRank | References |
16 | 0.92 | 14 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lorraine Tanabe | 1 | 383 | 29.80 |
W John Wilbur | 2 | 214 | 16.53 |