Friday, 19 October 2012

Comparing Distributed Indexing: To MapReduce or Not?

Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce...
newer post

On Single-Pass Indexing with MapReduce

Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed...
newer post

Tuesday, 16 October 2012


I am also a current developer of the Terrier Information Retrieval Platform. The current version of Terrier is 3.5. Terrier can be downloaded from her...
newer post

CROSS Streams

Currently hold a research assistant position at the University of Glasgow as part of the CROSS Streams EPSRC project, investigating real-time event detection using parallel content streams on top of the distributed stream processing platform Storm. Hompage Gr...
newer post


As I have now submitted my PhD and am no longer officially a student, it is time for a new look website. Pretty is is not? God knows how I am going to get this template to display by publications thou...
newer post
newer post Home