Title
An Automatic Approach for Discovering and Geocoding Locations in Domain-Specific Web Data (Application Paper).
Abstract
We present an automatic approach for discovering location names in WWW data culled from diverse domains. Our approach builds upon the Apache Tika, Apache OpenNLP, and Apache Lucene frameworks. Tika is used to extract text and metadata from any file. The text and metadata are provided to Apache OpenNLP and its location entity extraction model. The discovered location entities are then delivered to a gazetteer indexed in Apache Lucene derived from the Geonames.org dataset. This paper describes the overall approach and then explains in detail the challenges we faced, and the methodology that we employed to overcome them. We describe the evolution of our geo gazetteer process and algorithm and demonstrate the approachu0027s accuracy in data collected in the DARPA MEMEX and NSF Polar Cyber Infrastructure efforts.
Year
Venue
Field
2016
IRI
Data mining,Metadata,World Wide Web,Geocoding,Memex,Computer science,Cyber infrastructure,Artificial intelligence,Machine learning
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
2
Name
Order
Citations
PageRank
Chris A. Mattmann120025.39
Madhav Sharan200.68