Title
Quick-and-clean extraction of linked data entities from microblogs
Abstract
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content. We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools. We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.
Year
DOI
Venue
2014
10.1145/2660517.2660527
SEMANTICS
Keywords
DocType
Citations 
design,experimentation,measurement,text analysis,information search and retrieval,performance,optimization,big data
Conference
3
PageRank 
References 
Authors
0.37
15
5
Name
Order
Citations
PageRank
Oluwaseyi Feyisetan161.41
Elena Simperl21069122.60
Ramine Tinati314218.86
Markus Luczak-Rösch416722.68
Nigel Shadbolt54273321.53