Title
Knowledge capture from multiple online sources with the extensible web retrieval toolkit (eWRT)
Abstract
Knowledge capture approaches in the age of massive Web data require robust and scalable mechanisms to acquire, consolidate and pre-process large amounts of heterogeneous data, both unstructured and structured. This paper addresses this requirement by introducing the Extensible Web Retrieval Toolkit (eWRT), a modular Python API for retrieving social data from Web sources such as Delicious, Flickr, Yahoo! and Wikipedia. eWRT has been released as an open source library under GNU GPLv3. It includes classes for caching and data management, and provides low-level text processing capabilities including language detection, phonetic string similarity measures, and string normalization.
Year
DOI
Venue
2013
10.1145/2479832.2479861
K-CAP
Keywords
Field
DocType
string normalization,knowledge capture approach,social data,heterogeneous data,extensible web retrieval toolkit,web source,multiple online source,gnu gplv3,massive web data,phonetic string similarity measure,data management,social media,text mining,knowledge extraction,data acquisition
World Wide Web,Information retrieval,Computer science,Language identification,Knowledge extraction,Modular design,String metric,Data management,Python (programming language),Text processing,Scalability
Conference
Citations 
PageRank 
References 
0
0.34
13
Authors
3
Name
Order
Citations
PageRank
Albert Weichselbraun129128.39
Arno Scharl269667.13
Heinz-Peter Lang3121.54