Title
Generation of a large gene/protein lexicon by morphological pattern analysis.
Abstract
The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (+/-3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon.
Year
DOI
Venue
2004
10.1142/S0219720004000399
J. Bioinformatics and Computational Biology
Keywords
Field
DocType
information extraction,pattern analysis,bioinformatics
Inductive logic programming,Morphological pattern,Gene,Computer science,Natural language,Information extraction,Lexicon,Natural language processing,Artificial intelligence,Valid name,Bioinformatics,Named-entity recognition
Journal
Volume
Issue
ISSN
1
4
0219-7200
Citations 
PageRank 
References 
16
0.92
14
Authors
2
Name
Order
Citations
PageRank
Lorraine Tanabe138329.80
W John Wilbur221416.53