Friday 19 October 2012

Comparing Distributed Indexing: To MapReduce or Not?

0 comments
Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
Comparing Distributed Indexing: To MapReduce or Not?
In Proceedings of LSDS-IR 2009.
Boston, Massachusetts, USA, 2009.


PDF


Bibtex

newer post

On Single-Pass Indexing with MapReduce

0 comments

Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
On Single-Pass Indexing with MapReduce.
In Proceedings of the SIGIR 2009.
Boston, Massachusetts, USA, 2009.

PDF

Bibtex



newer post

Tuesday 16 October 2012

Terrier

0 comments

I am also a current developer of the Terrier Information Retrieval Platform. The current version of Terrier is 3.5. Terrier can be downloaded from here.

newer post

CROSS Streams

0 comments
Currently hold a research assistant position at the University of Glasgow as part of the CROSS Streams EPSRC project, investigating real-time event detection using parallel content streams on top of the distributed stream processing platform Storm. Hompage Grant
newer post

Welcome

0 comments
As I have now submitted my PhD and am no longer officially a student, it is time for a new look website. Pretty is is not? God knows how I am going to get this template to display by publications though.
newer post
newer post Home