Microblog retrieval challenges and opportunities

Autor: Rodriguez Perez, Jesus Alberto
Rok vydání: 2018
Předmět:
Druh dokumentu: Electronic Thesis or Dissertation
Popis: In recent years microblogging services have changed the way we communicate. Microblogs are a reduced version of web-blogs which are characterised by being just a few characters long. In the case of Twitter, messages known as \textit{tweets} are only 140 characters long, and are broadcasted from followees to followers organised as a social network. Microblogs such as tweets, are used to communicate up to the second information about any topic. Traffic updates, natural disaster reports, self-promotion, or product marketing are only a small portion of the type of information we can find across microblogging services. Most importantly, it has become a platform that has democratised the communication channels and empowered people into voicing their opinions. In fact, it is a very well known fact that the use Twitter amongst other social media services tilted the balance in favour of ex-president Obama when he was elected president of the USA in 2012. However, whilst the widespread use of microblogs has undoubtedly changed and shaped our current society, it is still very hard to effectively perform simple searches on such datasets due to the particular morphology of its documents. The limited character count and the ineffectiveness of state of the art retrieval models in producing relevant documents for queries, thus prompted TREC organisers to unite the research community into addressing these issues in 2011 during the first Microblog 2011 Track. This doctoral work is one of such efforts, and its focused on improving the access to microblog documents through ad-hoc searches. The first part of our work individually studies the behaviour of the state of the art retrieval models when utilised for microblog ad-hoc retrieval. First we contribute with the best configurations for each of the models studied. But more importantly, we discover how query term frequency and document length relates to the relevance of microblogs. As a result, we propose a microblog specific retrieval model, namely MBRM, which significantly outperforms the state of the art retrieval models described in this work. Furthermore we define an informativeness hypothesis in order to better understand the relevance of microblogs in terms of the presence of their inherent features or dimensions. We significantly improve the behaviour of a state of the art retrieval model by taking into consideration these dimensions as features into a linear combination re-ranking approach. Additionally we investigate the role that structure plays in determining the relevance of a microblog, by encoding the structure of relevant and non-relevant documents into two separate state machines. We then devise an approach to measure the similarity of an unobserved document towards each of these state machines, to then produce a score which is utilised for ranking. Our evaluation results demonstrate how the structure of microblogs plays a role in further differentiating relevant and non-relevant documents when ranking, by showing significantly improved results over a state of the art baseline. Subsequently we study the query performance prediction (QPP) task in terms of microblog ad-hoc retrieval. QPP represents the prediction of how well a query will be satisfied by a particular retrieval system. We study the performance of predictors in the context of microblogs and propose a number of microblog specific predictors. Finally our experimental evaluation demonstrates how our predictors outperform those in the literature in the microblog context. Finally, we address the ``vocabulary mismatch'' problem by studying the effect of utilising scores produced retrieval models as an ingredient in automatic query expansion (AQE) approaches based on pseudo relevance feedback. To this end we propose alternative approaches which do not rely directly on such scores and demonstrate higher stability when determining the most optimal terms for query expansion. In addition we propose an approach to estimate the quality of a term for query expansion. To this end we employ a classifier to determine whether a prospective query expansion term falls into a low, medium or high value category. The predictions performed by the classifier are then utilised to determine a boosting factor for such terms within an AQE approach. Then we conclude by proving that it is possible to predict the quality of terms by providing statistically enhanced results over an AQE baseline.
Databáze: Networked Digital Library of Theses & Dissertations