Tuesday, 20 November 2012

Evaluating Real-Time Search over Tweets

0 comments
Twitter offers a phenomenal platform for the social sharing of information. We describe new resources that have been created in the context of the Text Retrieval Conference (TREC) to support the academic study of Twitter as a real-time information source. We formalize an information seeking task—real-time search—and offer a methodology for measuring system effectiveness. At the TREC 2011 Microblog Track, 58 research groups participated in the first ever evaluation of this task. We present data from the effort to illustrate and support our methodology.

Ian Soboroff, Dean McCullough, Jimmy Lin, Craig Macdonald, Iadh Ounis and Richard McCreadie
Evaluating Real-Time Search over Tweets
In Proceedings of ICWSM 2012

PDF
Bibtex

newer post

CrowdTerrier: Automatic Crowdsourced Relevance Assessments with Terrier

0 comments
Information retrieval (IR) systems rely on document relevance assessments for queries to gauge their effectiveness for a variety of tasks, e.g. Web result ranking. Evaluation forums such as TREC and CLEF provide relevance assessments for common tasks. However, it is not possible for such venues to cover all of the collections and tasks currently investigated in IR. Hence, it falls to the individual researchers to generate the relevance assessments for new tasks and/or collections. Moreover, relevance assessment generation can be a time-consuming, difficult and potentially costly process. Recently, crowdsourcing has been shown to be a fast and cheap method to generate relevance assessments in a semi-automatic manner [1]. In this case, the relevance assessment task is outsourced to a large group of non-expert workers, where workers are rewarded via micro-payments....

Richard McCreadie, Craig Macdonald and Iadh Ounis.
CrowdTerrier: Automatic Crowdsourced Relevance Assessments with Terrier.
In Proceedings of SIGIR 2012

PDF
Bibtex

newer post

On Building a Reusable Twitter Corpus

0 comments
The Twitter real-time information network is the subject of research for information retrieval tasks such as real-time search. However, so far, reproducible experimentation on Twitter data has been impeded by restrictions imposed by the Twitter terms of service. In this paper, we detail a new methodology for legally building and distributing Twitter corpora, developed through collaboration between the Text REtrieval Conference (TREC) and Twitter. In particular, we detail how the first publicly available Twitter corpus – referred to as Tweets2011 – was distributed via lists of tweet identifiers and specialist tweet crawling software. Furthermore, we analyse whether this distribution approach remains robust over time, as tweets in the corpus are removed either by users or Twitter itself. Tweets2011 was successfully used by 58 participating groups for the TREC 2011 Microblog track, while our results attest to the robustness of the crawling methodology over time.

Richard McCreadie, Ian Soboroff, Jimmy Lin, Craig Macdonald, Iadh Ounis, and Dean McCullough.
On Building a Reusable Twitter Corpus.
In Proceedings of SIGIR 2012


PDF

Bibtex



newer post

University of Glasgow at TREC 2011: Experiments with Terrier in Crowdsourcing, Microblog, and Web Tracks

0 comments
In TREC 2011, we focus on tackling the new challenges proposed by the pilot Crowdsourcing and Microblog tracks, using our Terrier Information Retrieval Platform. Meanwhile, we continue to build upon our novel xQuAD framework and data-driven ranking approaches within Terrier to achieve effective and efficient ranking for the TREC Web track. In particular, the aim of our Microblog track participation is the development of a learning to rank approach for filtering within a tweet ranking environment, where tweets are ranked in reverse chronological order. In the Crowdsourcing track, we work to achieve a closer integration between the crowdsourcing marketplaces that are used for relevance assessment, and Terrier, which produces the document rankings to be assessed. Moreover, we focus on generating relevance assessments quickly and at a minimal cost. For the Web track, we enhance the data-driven learning support within Terrier by proposing a novel framework for the fast computation of document features for learning to rank.

Richard McCreadie, Craig Macdonald, Rodrygo. Santos and Iadh Ounis.
University of Glasgow at TREC 2011: Experiments with Terrier in Crowdsourcing, Microblog, and Web Tracks.
In Proceedings of TREC 2011

PDF
Bibtex


newer post

Identifying Top News using Crowdsourcing

0 comments
The influential Text REtrieval Conference (TREC) retrieval conference has always relied upon specialist assessors or occasionally participating groups to create relevance judgements for the tracks that it runs. Recently however, crowdsourcing has been championed as a cheap, fast and effective alternative to traditional TREC-like assessments. In 2010, TREC tracks experimented with crowdsourcing for the very first time. In this paper, we report our successful experience in creating relevance assessments for the TREC Blog track 2010 top news stories task using crowdsourcing. In particular, we crowdsourced both real-time newsworthiness assessments for news stories as well as traditional relevance assessments for blog posts. We conclude that crowdsourcing not only appears to be a feasible, but also cheap and fast means to generate relevance assessments. Furthermore, we detail our experiences running the crowdsourced evaluation of the TREC Blog track, discuss the lessons learned, and provide best practices.

Richard McCreadie, Craig Macdonald and Iadh Ounis.
Identifying Top News using Crowdsourcing.
Information Retrieval Journal, 2012

Springer Link
Bibtex


newer post

A Learned Approach for Ranking News in Real-time using the Blogosphere

0 comments
Newspaper websites and news aggregators rank news stories by their newsworthiness in real-time for display to the user. Recent work has shown that news stories can be ranked automatically in a retrospective manner based upon related discussion within the blogosphere. However, it is as yet undetermined whether blogs are sufficiently fresh to rank stories in real-time. In this paper, we propose a novel learning to rank framework which leverages current blog posts to rank news stories in a real-time manner. We evaluate our proposed learning framework within the context of the TREC Blog track top stories identification task. Our results show that, indeed, the blogosphere can be leveraged for the realtime ranking of news, including for unpredictable events. Our approach improves upon state-of-the-art story ranking approaches, outperforming both the best TREC 2009/2010 systems and its single best performing
feature.

Richard McCreadie, Craig Macdonald and Iadh Ounis.
A Learned Approach for Ranking News in Real-time using the Blogosphere.
In Proceedings of SPIRE 2011

PDF
Bibtex


newer post

University of Glasgow at TREC 2010: Experiments with Terrier in Blog and Web Tracks

0 comments
In TREC 2010, we continue to build upon the Voting Model and experiment with our novel xQuAD framework within the auspices of the Terrier IR Platform. In particular, our focus is the development of novel applications for data-driven learning in the Blog and Web tracks, with experimentation spanning hundreds of features. In the Blog track, we propose novel feature sets for the ranking of blogs, news stories and blog posts. In the Web track, we propose novel selective approaches for adhoc and diversity search.

Rodrygo. Santos, Richard McCreadie, Craig Macdonald and Iadh Ounis.
University of Glasgow at TREC 2010: Experiments with Terrier in Blog and Web Tracks.
In Proceedings of TREC 2010

PDF

Bibtex


newer post

MapReduce indexing strategies: Studying scalability and efficiency

0 comments
In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
MapReduce indexing strategies: Studying scalability and efficiency.
Information Processing and Management, Special Issue on Large Scale Distributed Systems.
DOI: 10.1016/j.ipm.2010.12.003

PDF
Bibtex


newer post

Crowdsourcing Blog Track Top News Judgments at TREC

0 comments
Since its inception, the venerable TREC retrieval conference has relied upon specialist assessors or participating groups to create relevance judgments for the tracks that it runs. However, recently crowdsourcing has been proposed as a possible alternative to traditional TREC-like assessments, supporting fast accumulation of judgments at a low cost. 2010 was the first year that TREC experimented with crowdsourcing. In this paper, we report our successful experience in creating relevance assessments for the TREC Blog track 2010 top news stories task. We conclude that crowdsourcing is an effective alternative to using specialist assessors or participating groups for this task.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
Crowdsourcing Blog Track Top News Judgments at TREC.
In Proceedings of CSDM 2010.
Hong Kong, China, 2011.

PDF
Bibtex

newer post

Crowdsourcing a News Query Classification Dataset

0 comments
Web search engines are well known for aggregating news vertical content into their result rankings in response to queries classified as news-related. However, no dataset currently exists upon which approaches to news query classification can be evaluated and compared. This paper studies the generation and validation of a news query classification dataset comprised of labels crowdsourced from Amazon’s Mechanical Turk and details insights gained. Notably, our study focuses around two challenges when crowdsourcing news query classification labels: 1) how to overcome our workers’ lack of information about the news stories from the time of each query and 2) how to ensure the resulting labels are of high enough quality to make the dataset useful. We empirically show that a worker’s lack of information about news stories can be addressed through the integration of news-related content into the labelling interface and that this improves the quality of the resulting labels. Overall, we find that crowdsourcing is suitable for building a news query classification dataset.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
Crowdsourcing a News Query Classification Dataset.
In Proceedings of CSE 2010.
Geneva, Switzerland, 2010.

PDF
Bibtex

newer post

Leveraging User-Generated Content for News Search

0 comments
Over the last few years both availability and accessibility of current news stories on the Web have dramatically improved [3]. In particular, users can now access news from a variety of sources hosted on theWeb, from newswire presences such as the New York Times, to integrated news search within Web search engines. However, of central interest is the emerging impact that user-generated content (UGC) is having on this online news landscape. Indeed, the emergence of Web 2.0 has turned a static news consumer base into a dynamic news machine, where news stories are summarised and commented upon. In summary, value is being added to each news story in terms of additional content. ...

Richard McCreadie.
Leveraging User-Generated Content for News Search.
In Proceedings of SIGIR'2010.
Geneva, Switzerland, 2010.

PDF
Bibtex

newer post

News Article Ranking: Leveraging the Wisdom of Bloggers

0 comments
Every day, editors rank news articles for placement within their
newspapers. In this paper, we investigate how news article ranking
can be performed automatically. In particular, we investigate
the blogosphere as a prime source of evidence, on the intuition that
bloggers, and by extension their blog posts, can indicate interest
in one news article or another. Moreover, we propose to model this
automatic news article ranking task as a voting process, where each
relevant blog post acts as a vote for one or more news articles. We
evaluate this approach using the TREC 2009 Blog track top news
story identification task judgments, showing strong performance in
comparison to TREC systems, as well as two alternative baseline
rankings. Furthermore, to increase the accuracy of the proposed approach,
we examine temporal re-ranking techniques, corpus cleaning
of inappropriate articles and article expansion to counter vocabulary
mismatch. We conclude that, overall, blog post evidence can
be a useful indicator to a news editor as to the importance of various
news stories, and that our approaches for extracting this evidence
are effective.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
News Article Ranking: Leveraging the Wisdom of Bloggers.
In Proceedings of RIAO 2010.
Paris, France, 2010.

PDF
Bibtex

newer post

Insights of the Horizons of News Search

0 comments
In recent years, news reporting and consumption has made the profound shift from paper-based media to free online publications, while the simultaneous emergence of Web 2.0 has fundamentally changed the way we react to news. In this paper, we argue that the rapid increase in volume of user-generated content now available presents new and exciting opportunities for the furtherment of news search. In particular, we discuss new applications for user-generated content when determining the stories of the moment, as well as adding value to the results returned.

Richard McCreadie, Craig Macdonald, and Iadh Ounis. 
Insights of the Horizons of News Search.
In Proceedings of SSM'2010.
New York, NY, USA, 2010.

PDF
Bibtex

newer post

Friday, 19 October 2012

Comparing Distributed Indexing: To MapReduce or Not?

0 comments
Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
Comparing Distributed Indexing: To MapReduce or Not?
In Proceedings of LSDS-IR 2009.
Boston, Massachusetts, USA, 2009.


PDF


Bibtex

newer post

On Single-Pass Indexing with MapReduce

0 comments

Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
On Single-Pass Indexing with MapReduce.
In Proceedings of the SIGIR 2009.
Boston, Massachusetts, USA, 2009.

PDF

Bibtex



newer post

Tuesday, 16 October 2012

Terrier

0 comments

I am also a current developer of the Terrier Information Retrieval Platform. The current version of Terrier is 3.5. Terrier can be downloaded from here.

newer post

CROSS Streams

0 comments
Currently hold a research assistant position at the University of Glasgow as part of the CROSS Streams EPSRC project, investigating real-time event detection using parallel content streams on top of the distributed stream processing platform Storm. Hompage Grant
newer post

Welcome

0 comments
As I have now submitted my PhD and am no longer officially a student, it is time for a new look website. Pretty is is not? God knows how I am going to get this template to display by publications though.
newer post
newer post Home