The Twitter real-time information network is the subject
of research for information retrieval tasks such as real-time
search. However, so far, reproducible experimentation on
Twitter data has been impeded by restrictions imposed by
the Twitter terms of service. In this paper, we detail a new
methodology for legally building and distributing Twitter
corpora, developed through collaboration between the Text
REtrieval Conference (TREC) and Twitter. In particular,
we detail how the first publicly available Twitter corpus –
referred to as Tweets2011 – was distributed via lists of tweet
identifiers and specialist tweet crawling software. Furthermore,
we analyse whether this distribution approach remains
robust over time, as tweets in the corpus are removed either
by users or Twitter itself. Tweets2011 was successfully used
by 58 participating groups for the TREC 2011 Microblog
track, while our results attest to the robustness of the crawling
methodology over time.
Richard McCreadie, Ian Soboroff, Jimmy Lin, Craig Macdonald, Iadh Ounis, and Dean McCullough.
On Building a Reusable Twitter Corpus.
In Proceedings of SIGIR 2012
0 comments:
Post a Comment