NLProt: extracting protein names and sequences from papers. - Citegraph

Paper Info

Title
NLProt: extracting protein names and sequences from papers.

Abstract
Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.

Year	DOI	Venue
2004	10.1093/nar/gkh427	NUCLEIC ACIDS RESEARCH
Keywords	Field	DocType
nucleic	Information retrieval,Biology,Internet servers,Identifier,Support vector machine,Biological database,Genetics,MEDLINE,The Internet	Journal
Volume	Issue	ISSN
32	SUPnan	0305-1048
Citations	PageRank	References
15	0.76	14
Authors
2

Authors (2 rows)

Cited by (15 rows)

References (14 rows)

Name	Order	Citations	PageRank
Sven Mika	1	106	8.59
Burkhard Rost	2	795	88.14

1