Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.
Richard McCreadie, Craig Macdonald, and Iadh Ounis.
On Single-Pass Indexing with MapReduce.
In Proceedings of the SIGIR 2009.
Boston, Massachusetts, USA, 2009.
0 comments:
Post a Comment