Generation of a large gene/protein lexicon by morphological pattern analysis. - Citegraph

Paper Info

Title
Generation of a large gene/protein lexicon by morphological pattern analysis.

Abstract
The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (+/-3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon.

Year	DOI	Venue
2004	10.1142/S0219720004000399	J. Bioinformatics and Computational Biology
Keywords	Field	DocType
information extraction,pattern analysis,bioinformatics	Inductive logic programming,Morphological pattern,Gene,Computer science,Natural language,Information extraction,Lexicon,Natural language processing,Artificial intelligence,Valid name,Bioinformatics,Named-entity recognition	Journal
Volume	Issue	ISSN
1	4	0219-7200
Citations	PageRank	References
16	0.92	14
Authors
2

Authors (2 rows)

Cited by (16 rows)

References (14 rows)

Name	Order	Citations	PageRank
Lorraine Tanabe	1	383	29.80
W John Wilbur	2	214	16.53

1