Wednesday, 17 July 2013

News Vertical Search: When and What to Display to Users


News reporting has seen a shift toward fast-paced online reporting in new sources such as social media. Web Search engines that support a news vertical have historically relied upon articles published by major newswire providers when serving news-related queries. In this paper, we investigate to what extent real-time content from newswire, blogs, Twitter and Wikipedia sources are useful to return to the user in the current fast-paced news search setting. In particular, we perform a detailed user study using the emerging medium of crowdsourcing to determine when and where integrating news-related content from these various sources can better serve the user’s news need. We sampled approximately 300 news-related search queries using Google Trends and Bitly data in real-time for two time periods. For these queries, we have crowdsourced workers compare Web search rankings for each, with similar rankings integrating real-time news content from sources such as Twitter or the blogosphere. Our results show that users exhibited a preference for rankings integrating newswire articles for only half of our queries, indicating that relying solely on newswire providers for news-related content is now insufficient. Moreover, our results show that users preferred rankings that integrate tweets more often than those that integrate newswire articles, showing the potential of using social media to better serve news queries.

Richard McCreadie, Craig Macdonald, Iadh Ounis
News Vertical Search: When and What to Display to Users
In Proceedings of SIGIR 2013

newer post

Monday, 15 July 2013

Learning to Process Big Data with MapReduce and Hadoop - Hands-On Exercises

I recently ran a hands-on session at the SICSA Summer School on Big Data Information Visualisation on MapReduce and Hadoop. The aim was to teach students how to process Big Data in a fast scalable manner using the  MapReduce programming paradigm. The students started with a simple word counting program, learned its structure and how to deploy it both locally and to a Amazon Web Services Elastic MapReduce Cluster. They then learned how to modify MapReduce programs to find top hashtags in Twitter data and design output formats to target another application, i.e. Twitter search using the open source Terrier Information Retrieval platform.

The session was highly successful and as such I am making the exercise sheets and source code freely available. The exercise sheets assume that you have some familiarity with Java programming, are running either Linux or Mac and if you want to use Elastic MapReduce then you have already created an account (if not then I suggest you run through the video tutorials available here).

Download the Lab Code-Base Now
Download the Exercise Sheets Now
  • This is a PDF containing the three exercise sheets.
newer post
newer post older post Home