Abstract | ||
---|---|---|
Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure 0.7, nearly 60% of which were 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the systemýs internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques. |
Year | DOI | Venue |
---|---|---|
2004 | 10.1109/CSB.2004.45 | CSB |
Keywords | Field | DocType |
supervised learning,proteins,learning artificial intelligence,molecular biophysics,genetics | Computer science,Supervised learning,Artificial intelligence,Scalable system,Bioinformatics,MEDLINE,Gene nomenclature,Machine learning | Conference |
ISBN | Citations | PageRank |
0-7695-2194-0 | 18 | 0.98 |
References | Authors | |
10 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Raf M. Podowski | 1 | 30 | 2.14 |
John G. Cleary | 2 | 1791 | 365.78 |
Nicholas T. Goncharoff | 3 | 24 | 1.44 |
Gregory Amoutzias | 4 | 24 | 1.44 |
William S. Hayes | 5 | 31 | 2.50 |