Abstract | ||
---|---|---|
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content. We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools. We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology. |
Year | DOI | Venue |
---|---|---|
2014 | 10.1145/2660517.2660527 | SEMANTICS |
Keywords | DocType | Citations |
design,experimentation,measurement,text analysis,information search and retrieval,performance,optimization,big data | Conference | 3 |
PageRank | References | Authors |
0.37 | 15 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Oluwaseyi Feyisetan | 1 | 6 | 1.41 |
Elena Simperl | 2 | 1069 | 122.60 |
Ramine Tinati | 3 | 142 | 18.86 |
Markus Luczak-Rösch | 4 | 167 | 22.68 |
Nigel Shadbolt | 5 | 4273 | 321.53 |