Richard McCreadie

Friday, 19 October 2012

Comparing Distributed Indexing: To MapReduce or Not?

Posted by Dr. Richard McCreadie at 08:12 0 comments

Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
Comparing Distributed Indexing: To MapReduce or Not?
In Proceedings of LSDS-IR 2009.
Boston, Massachusetts, USA, 2009.

PDF

Bibtex

On Single-Pass Indexing with MapReduce

Posted by Dr. Richard McCreadie at 07:42 0 comments

Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
On Single-Pass Indexing with MapReduce.
In Proceedings of the SIGIR 2009.
Boston, Massachusetts, USA, 2009.

PDF

Bibtex

Tuesday, 16 October 2012

Terrier

Posted by Dr. Richard McCreadie at 07:24 0 comments

I am also a current developer of the Terrier Information Retrieval Platform. The current version of Terrier is 3.5. Terrier can be downloaded from here.

CROSS Streams

Posted by Dr. Richard McCreadie at 07:15 0 comments

Currently hold a research assistant position at the University of Glasgow as part of the CROSS Streams EPSRC project, investigating real-time event detection using parallel content streams on top of the distributed stream processing platform Storm. Hompage Grant

Welcome

Posted by Dr. Richard McCreadie at 07:10 0 comments

As I have now submitted my PhD and am no longer officially a student, it is time for a new look website. Pretty is is not? God knows how I am going to get this template to display by publications though.

Friday, 19 October 2012

Comparing Distributed Indexing: To MapReduce or Not?

PDF

Bibtex

On Single-Pass Indexing with MapReduce

PDF

Bibtex

Tuesday, 16 October 2012

Terrier

CROSS Streams

Welcome

Links

Blog Archive

About Me

Twitter Feed