On building a reusable Twitter corpus - Citegraph

Paper Info

Title
On building a reusable Twitter corpus

Abstract
The Twitter real-time information network is the subject of research for information retrieval tasks such as real-time search. However, so far, reproducible experimentation on Twitter data has been impeded by restrictions imposed by the Twitter terms of service. In this paper, we detail a new methodology for legally building and distributing Twitter corpora, developed through collaboration between the Text REtrieval Conference (TREC) and Twitter. In particular, we detail how the first publicly available Twitter corpus - referred to as Tweets2011 - was distributed via lists of tweet identifiers and specialist tweet crawling software. Furthermore, we analyse whether this distribution approach remains robust over time, as tweets in the corpus are removed either by users or Twitter itself. Tweets2011 was successfully used by 58 participating groups for the TREC 2011 Microblog track, while our results attest to the robustness of the crawling methodology over time.

Year	DOI	Venue
2012	10.1145/2348283.2348495	SIGIR
Keywords	Field	DocType
new methodology,information retrieval task,twitter term,twitter corpus,real-time search,available twitter corpus,reusable twitter corpus,twitter data,specialist tweet,crawling methodology,twitter real-time information network,reproducibility,information retrieval,real time	Data mining,World Wide Web,Crawling,Social media,Information retrieval,Identifier,Computer science,Microblogging,Robustness (computer science),Software,Terms of service,Text Retrieval Conference	Conference
Citations	PageRank	References
39	1.80	1
Authors
6

Authors (6 rows)

Cited by (39 rows)

References (1 rows)

Name	Order	Citations	PageRank
Richard Mccreadie	1	403	32.43
Ian Soboroff	2	1907	218.39
Jimmy Lin	3	4800	376.93
Craig Macdonald	4	2588	178.50
Iadh Ounis	5	3438	234.59
Dean McCullough	6	59	2.94

1