Richard McCreadie

Tuesday, 3 September 2013

Scalable Distributed Event Detection for Twitter

Posted by Dr. Richard McCreadie at 03:25 1 comments

Social media streams, such as Twitter, have shown themselves to be useful sources of real-time information about what is happening in the world. Automatic detection and tracking of events identified in these streams have a variety of real-world applications, e.g. identifying and automatically reporting road accidents for emergency services. However, to be useful, events need to be identified within the stream with a very low latency. This is challenging due to the high volume of posts within these social streams. In this paper, we propose a novel event detection approach that can both effectively detect events within social streams like Twitter and can scale to thousands of posts every second. Through experimentation on a large Twitter dataset, we show that our approach can process the equivalent to the full Twitter Firehose stream, while maintaining event detection accuracy and outperforming an alternative distributed event detection system.

Richard McCreadie, Craig Macdonald, Iadh Ounis, Miles Osborne, Sasa Petrovic
Scalable Distributed Event Detection for Twitter
In Proceedings of the IEEE Big Data Conference 2013

PDF
BIBTEX

Wednesday, 17 July 2013

News Vertical Search: When and What to Display to Users

Posted by Dr. Richard McCreadie at 03:33 1 comments

News reporting has seen a shift toward fast-paced online reporting in new sources such as social media. Web Search engines that support a news vertical have historically relied upon articles published by major newswire providers when serving news-related queries. In this paper, we investigate to what extent real-time content from newswire, blogs, Twitter and Wikipedia sources are useful to return to the user in the current fast-paced news search setting. In particular, we perform a detailed user study using the emerging medium of crowdsourcing to determine when and where integrating news-related content from these various sources can better serve the user’s news need. We sampled approximately 300 news-related search queries using Google Trends and Bitly data in real-time for two time periods. For these queries, we have crowdsourced workers compare Web search rankings for each, with similar rankings integrating real-time news content from sources such as Twitter or the blogosphere. Our results show that users exhibited a preference for rankings integrating newswire articles for only half of our queries, indicating that relying solely on newswire providers for news-related content is now insufficient. Moreover, our results show that users preferred rankings that integrate tweets more often than those that integrate newswire articles, showing the potential of using social media to better serve news queries.

Richard McCreadie, Craig Macdonald, Iadh Ounis
News Vertical Search: When and What to Display to Users
In Proceedings of SIGIR 2013

PDF
BIBTEX

Monday, 15 July 2013

Learning to Process Big Data with MapReduce and Hadoop - Hands-On Exercises

Posted by Dr. Richard McCreadie at 05:28 160 comments

I recently ran a hands-on session at the SICSA Summer School on Big Data Information Visualisation on MapReduce and Hadoop. The aim was to teach students how to process Big Data in a fast scalable manner using the MapReduce programming paradigm. The students started with a simple word counting program, learned its structure and how to deploy it both locally and to a Amazon Web Services Elastic MapReduce Cluster. They then learned how to modify MapReduce programs to find top hashtags in Twitter data and design output formats to target another application, i.e. Twitter search using the open source Terrier Information Retrieval platform.

The session was highly successful and as such I am making the exercise sheets and source code freely available. The exercise sheets assume that you have some familiarity with Java programming, are running either Linux or Mac and if you want to use Elastic MapReduce then you have already created an account (if not then I suggest you run through the video tutorials available here).

Download the Lab Code-Base Now

This is a code bundle containing Hadoop version 1.1.2, s3cmd, elastic-mapreduce and an Eclipse workspace containing the incomplete Java classes that will be modified during the exercises.

Download the Exercise Sheets Now

This is a PDF containing the three exercise sheets.

Tuesday, 2 April 2013

Relevance in Microblogs: Enhancing Tweet Retrieval using Hyperlinked Documents

Posted by Dr. Richard McCreadie at 04:11 0 comments

Twitter serves over 1.6 billion searches each day, ranking tweets for display to the user in reverse-chronological order. However, finding relevant tweets can be a challenging task, since the relevance of a tweet is dependant both on its content and whether it links to a useful document. In this paper, we investigate how the content of documents hyperlinked from a tweet can be used to better estimate that tweet’s relevance. In particular, we propose three approaches for incorporating the content of hyperlinked documents when ranking tweets. Within the context of the TREC 2011 and 2012 Microblog Tracks, we thoroughly evaluate to what extent hyperlinked documents can aid tweet retrieval effectiveness. Our results show that the application of hyperlinked documents can improve retrieval effectiveness over using the tweet content alone as well as using the presence of a URL within the tweet as a feature.

Richard McCreadie, Craig Macdonald
Relevance in Microblogs: Enhancing Tweet Retrieval using Hyperlinked Documents
In Proceedings of OAIR 2013

PDF
BIBTEX

Wednesday, 27 March 2013

University of Glasgow at TREC 2012: Experiments with Terrier in Medical Records, Microblog, and Web Tracks

Posted by Dr. Richard McCreadie at 06:26 0 comments

In TREC 2012, we focus on tackling the new challenges posed by the Medical, Microblog and Web tracks, using our Terrier Information Retrieval Platform. In particular, for the Medical track, we investigate how to exploit implicit knowledge within medical records, with the aim of better identifying those records from patients with specific medical conditions. For the Microblog track adhoc task, we investigate novel techniques to leverage documents hyperlinked from tweets to better estimate relevance of those tweets and increase recall. Meanwhile, for the Microblog track filtering task, we developed a new stream processing infrastructure for real-time adaptive filtering on top of the Storm framework. For the TREC Web track, we continue to build upon our learning-to-rank approaches and novel xQuAD framework within Terrier, increasing both effectiveness and efficiency when ranking.

Nut Limsopatham, Richard McCreadie, M-Dyaa Albakour, Craig Macdonald, Rodrygo L. T. Santos, and Iadh Ounis
University of Glasgow at TREC 2012: Experiments with Terrier in Medical Records, Microblog, and Web Tracks
Proceedings of TREC 2012

PDF
BIBTEX

Thursday, 7 February 2013

Thesis: News Vertical Search using User-Generated Content

Posted by Dr. Richard McCreadie at 10:35 0 comments

The thesis investigates how content produced by end-users on the World Wide Web --- referred to as user-generated content --- can enhance the news vertical aspect of a universal Web search engine, such that news-related queries can be satisfied more accurately, comprehensively and in a more timely manner. We propose a news search framework to describe the news vertical aspect of a universal web search engine. This framework is comprised of four components, each providing a different piece of functionality. The Top Events Identification component identifies the most important events that are happening at any given moment using discussion in user-generated content streams. The News Query Classification component classifies incoming queries as news-related or not in real-time. The Ranking News-Related Content component finds and ranks relevant content for news-related user queries from multiple streams of news and user-generated content. Finally, the News-Related Content Integration component merges the previously ranked content for the user query into the Web search ranking. In this thesis, we argue that user-generated content can be leveraged in one or more of these components to better satisfy news-related user queries. Potential enhancements include the faster identification of news queries relating to breaking news events, more accurate classification of news-related queries, increased coverage of the events searched for by the user or increased freshness in the results returned.

Approaches to tackle each of the four components of the news search framework are proposed, which aim to leverage user-generated content. Together, these approaches form the news vertical component of a universal Web search engine. Each approach proposed for a component is thoroughly evaluated using one or more datasets developed for that component. Conclusions are derived concerning whether the use of user-generated content enhances the component in question using an appropriate measure, namely: effectiveness when ranking events by their current importance/newsworthiness for the Top Events Identification component; classification accuracy over different types of query for the News Query Classification component; relevance of the documents returned for the Ranking News-Related Content component; and end-user preference for rankings integrating user-generated content in comparison to the unaltered Web search ranking for the News-Related Content Integration component. Analysis of the proposed approaches themselves, the effective settings for the deployment of those approaches and insights into their behaviour are also discussed.

In particular, the evaluation of the Top Events Identification component examines how effectively events --- represented by newswire articles --- can be ranked by their importance using two different streams of user-generated content, namely blog posts and Twitter tweets. Evaluation of the proposed approaches for this component indicates that blog posts are an effective source of evidence to use when ranking events and that these approaches achieve state-of-the-art effectiveness. Using the same approaches instead driven by a stream of tweets, provide a story ranking performance that is significantly more effective than random, but is not consistent across all of the datasets and approaches tested. Insights are provided into the reasons for this with regard to the transient nature of discussion in Twitter.

Through the evaluation of the News Query Classification component, we show that the use of timely features extracted from different news and user-generated content sources can increase the accuracy of news query classification over relying upon newswire provider streams alone. Evidence also suggests that the usefulness of the user-generated content sources varies as news events mature, with some sources becoming more influential over time as new content is published, leading to an upward trend in classification accuracy.

The Ranking News-Related Content component evaluation investigates how to effectively rank content from the blogosphere and Twitter for news-related user queries. Of the approaches tested, we show that learning to rank approaches using features specific to blog posts/tweets lead to state-of-the-art ranking effectiveness under real-time constraints.

Finally this thesis demonstrates that the majority of end-users prefer rankings integrated with user-generated content for news-related queries to rankings containing only Web search results or integrated with only newswire articles. Of the user-generated content sources tested, the most popular source is shown to be Twitter, particularly for queries relating to breaking events.

The central contributions of this thesis are the introduction of a news search framework, the approaches to tackle each of the four components of the framework that integrate user-generated content and their subsequent evaluation in a simulated real-time setting. This thesis draws insights from a broad range of experiments spanning the entire search process for news-related queries. The experiments reported in this thesis demonstrate the potential and scope for enhancements that can be brought about by the leverage of user-generated content for real-time news search and related applications.

News Vertical Search using User-Generated Content
Richard McCreadie
University of Glasgow, 2012

PDF
Bibtex

Tuesday, 15 January 2013

Bieber no more: First Story Detection using Twitter and Wikipedia

Posted by Dr. Richard McCreadie at 09:59 0 comments

Twitter is a well known source of information regarding

breaking news stories. This aspect of Twitter makes it ideal
for identifying events as they happen. However, a key problem
with Twitter-driven event detection approaches is that
they produce many spurious events, i.e., events that are
wrongly detected or simply are of no interest to anyone.
In this paper, we examine whether Wikipedia (when viewed
as a stream of page views) can be used to improve the quality
of discovered events in Twitter. Our results suggest that
Wikipedia is a powerful filtering mechanism, allowing for
easy blocking of large numbers of spurious events. Our results
also indicate that events within Wikipedia tend to lag
behind Twitter.

Miles Osborne and Saša Petrovic, Richard McCreadie, Craig Macdonald, and Iadh Ounis
Bieber no more: First Story Detection using Twitter and Wikipedia
In Proceedings of TAIA 2012 at SIGIR 2012

PDF
BIBTEX