Title
Knowledge-driven geospatial location resolution for phylogeographic models of virus migration
Abstract
A Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a 'metadata heuristic'). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.
Year
DOI
Venue
2015
10.1093/bioinformatics/btv259
BIOINFORMATICS
Field
DocType
Volume
Phylogeography,Geospatial analysis,Metadata,Population,Data mining,Heuristic,Geospatial metadata,Computer science,Heuristics,Bioinformatics,GenBank
Journal
31
Issue
ISSN
Citations 
12
1367-4803
7
PageRank 
References 
Authors
0.57
13
7
Name
Order
Citations
PageRank
Davy Weissenbacher1121.75
Tasnia Tahsin2303.28
Rachel Beard3121.77
Mari Figaro470.57
Robert Rivera5121.77
Matthew Scotch612311.56
Graciela Gonzalez712010.03