Title
Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers.
Abstract
Background  Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed. Results  This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html. Conclusions  Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus.
Year
DOI
Venue
2011
10.1186/1471-2105-12-S4-S4
BMC Bioinformatics
Keywords
Field
DocType
information extraction,natural language,genetic variation,microarrays,genomics,bioinformatics,protein sequence,single nucleotide polymorphism,mutation,algorithms
Regular expression,Identifier,Biology,dbSNP,Genomics,Natural language,Information extraction,Single-nucleotide polymorphism,Bioinformatics,Genetics,Database,Reference genome
Journal
Volume
Issue
ISSN
12
S-4
1471-2105
Citations 
PageRank 
References 
19
0.83
32
Authors
5
Name
Order
Citations
PageRank
Philippe Thomas123812.94
Roman Klinger220129.85
Laura Inés Furlong3431.91
Martin Hofmann-Apitius437230.08
Christoph M. Friedrich518625.44