Quick-and-clean extraction of linked data entities from microblogs - Citegraph

Paper Info

Title
Quick-and-clean extraction of linked data entities from microblogs

Abstract
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content. We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools. We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.

Year	DOI	Venue
2014	10.1145/2660517.2660527	SEMANTICS
Keywords	DocType	Citations
design,experimentation,measurement,text analysis,information search and retrieval,performance,optimization,big data	Conference	3
PageRank	References	Authors
0.37	15	5

Authors (5 rows)

Cited by (3 rows)

References (15 rows)

Name	Order	Citations	PageRank
Oluwaseyi Feyisetan	1	6	1.41
Elena Simperl	2	1069	122.60
Ramine Tinati	3	142	18.86
Markus Luczak-Rösch	4	167	22.68
Nigel Shadbolt	5	4273	321.53

1