Crowdsourcing a News Query Classification Dataset

Tuesday, 20 November 2012

Crowdsourcing a News Query Classification Dataset

Posted by Dr. Richard McCreadie at 06:21

Web search engines are well known for aggregating news vertical content into their result rankings in response to queries classified as news-related. However, no dataset currently exists upon which approaches to news query classification can be evaluated and compared. This paper studies the generation and validation of a news query classification dataset comprised of labels crowdsourced from Amazon’s Mechanical Turk and details insights gained. Notably, our study focuses around two challenges when crowdsourcing news query classification labels: 1) how to overcome our workers’ lack of information about the news stories from the time of each query and 2) how to ensure the resulting labels are of high enough quality to make the dataset useful. We empirically show that a worker’s lack of information about news stories can be addressed through the integration of news-related content into the labelling interface and that this improves the quality of the resulting labels. Overall, we find that crowdsourcing is suitable for building a news query classification dataset.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
Crowdsourcing a News Query Classification Dataset.
In Proceedings of CSE 2010.
Geneva, Switzerland, 2010.