The thesis investigates how content produced by end-users on the World Wide Web --- referred to as user-generated content --- can enhance the news vertical aspect of a universal Web search engine, such that news-related queries can be satisfied more accurately, comprehensively and in a more timely manner. We propose a news search framework to describe the news vertical aspect of a universal web search engine. This framework is comprised of four components, each providing a different piece of functionality. The Top Events Identification component identifies the most important events that are happening at any given moment using discussion in user-generated content streams. The News Query Classification component classifies incoming queries as news-related or not in real-time. The Ranking News-Related Content component finds and ranks relevant content for news-related user queries from multiple streams of news and user-generated content. Finally, the News-Related Content Integration component merges the previously ranked content for the user query into the Web search ranking. In this thesis, we argue that user-generated content can be leveraged in one or more of these components to better satisfy news-related user queries. Potential enhancements include the faster identification of news queries relating to breaking news events, more accurate classification of news-related queries, increased coverage of the events searched for by the user or increased freshness in the results returned.
Approaches to tackle each of the four components of the news search framework are proposed, which aim to leverage user-generated content. Together, these approaches form the news vertical component of a universal Web search engine. Each approach proposed for a component is thoroughly evaluated using one or more datasets developed for that component. Conclusions are derived concerning whether the use of user-generated content enhances the component in question using an appropriate measure, namely: effectiveness when ranking events by their current importance/newsworthiness for the Top Events Identification component; classification accuracy over different types of query for the News Query Classification component; relevance of the documents returned for the Ranking News-Related Content component; and end-user preference for rankings integrating user-generated content in comparison to the unaltered Web search ranking for the News-Related Content Integration component. Analysis of the proposed approaches themselves, the effective settings for the deployment of those approaches and insights into their behaviour are also discussed.
In particular, the evaluation of the Top Events Identification component examines how effectively events --- represented by newswire articles --- can be ranked by their importance using two different streams of user-generated content, namely blog posts and Twitter tweets. Evaluation of the proposed approaches for this component indicates that blog posts are an effective source of evidence to use when ranking events and that these approaches achieve state-of-the-art effectiveness. Using the same approaches instead driven by a stream of tweets, provide a story ranking performance that is significantly more effective than random, but is not consistent across all of the datasets and approaches tested. Insights are provided into the reasons for this with regard to the transient nature of discussion in Twitter.
Through the evaluation of the News Query Classification component, we show that the use of timely features extracted from different news and user-generated content sources can increase the accuracy of news query classification over relying upon newswire provider streams alone. Evidence also suggests that the usefulness of the user-generated content sources varies as news events mature, with some sources becoming more influential over time as new content is published, leading to an upward trend in classification accuracy.
The Ranking News-Related Content component evaluation investigates how to effectively rank content from the blogosphere and Twitter for news-related user queries. Of the approaches tested, we show that learning to rank approaches using features specific to blog posts/tweets lead to state-of-the-art ranking effectiveness under real-time constraints.
Finally this thesis demonstrates that the majority of end-users prefer rankings integrated with user-generated content for news-related queries to rankings containing only Web search results or integrated with only newswire articles. Of the user-generated content sources tested, the most popular source is shown to be Twitter, particularly for queries relating to breaking events.
The central contributions of this thesis are the introduction of a news search framework, the approaches to tackle each of the four components of the framework that integrate user-generated content and their subsequent evaluation in a simulated real-time setting. This thesis draws insights from a broad range of experiments spanning the entire search process for news-related queries. The experiments reported in this thesis demonstrate the potential and scope for enhancements that can be brought about by the leverage of user-generated content for real-time news search and related applications.
News Vertical Search using User-Generated Content
Richard McCreadie
University of Glasgow, 2012
PDF
Bibtex