Title
AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names
Abstract
Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure 0.7, nearly 60% of which were 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the systemýs internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.
Year
DOI
Venue
2004
10.1109/CSB.2004.45
CSB
Keywords
Field
DocType
supervised learning,proteins,learning artificial intelligence,molecular biophysics,genetics
Computer science,Supervised learning,Artificial intelligence,Scalable system,Bioinformatics,MEDLINE,Gene nomenclature,Machine learning
Conference
ISBN
Citations 
PageRank 
0-7695-2194-0
18
0.98
References 
Authors
10
5
Name
Order
Citations
PageRank
Raf M. Podowski1302.14
John G. Cleary21791365.78
Nicholas T. Goncharoff3241.44
Gregory Amoutzias4241.44
William S. Hayes5312.50