Popular microblogging services, such as Twitter, are engaging millions of users who constantly post and share information about news and current events each day, resulting in millions of messages discussing what is happening in the world. To help users obtain an overview of microblog content relating to topics and events that they are interested in, classical summarisation techniques from the newswire domain have been successfully applied and extended for use on microblogs. However, much of the current literature on microblog summarisation assumes that the summarisation evaluation measures that have been shown to be e ffective on newswire, are still appropriate for evaluating microblog summarisation. Hence, in this paper, we aim to determine whether the traditional automatic newswire summarisation evaluation metrics generalise to the task of microblog summarisation. In particular, using three microblog summarisation datasets, we determine a ranking of summarisation systems under three automatic summarisation evaluation metrics from the literature. We then compare and contrast this ranking of systems produced under each metric to system rankings produced through a qualitative user evaluation, with the aim of determining which metric best simulates human summarisation preferences. Our results indicate that, for the automatic evaluation metrics we investigate, they do not always concur with each other. Further, we find that Fraction of Topic Words better agrees with what users tell us about the quality and e ectiveness of microblog summaries than the ROUGE-1 measure that is most commonly reported in the literature.
Stuart Mackie, Richard McCreadie, Craig Macdonald and Iadh Ounis.
On Choosing an Effective Automatic Evaluation Metric for Microblog Summarisation
In Proceedings of IIIX 2014.
BIBTEX