New Perspectives to Query Performance Prediction Evaluation

Autor:	Oleg Zendel
Rok vydání:	2021
Předmět:	Computer science business.industry media_common.quotation_subject Information needs 02 engineering and technology Machine learning computer.software_genre Set (abstract data type) Transformation (function) 020204 information systems Component (UML) 0202 electrical engineering electronic engineering information engineering Performance prediction 020201 artificial intelligence & image processing Multiple correlation Relevance (information retrieval) Quality (business) Artificial intelligence business computer media_common
Zdroj:	SIGIR
DOI:	10.1145/3404835.3463270
Popis:	The research on Query Performance Prediction (QPP) focuses on estimating the effectiveness of retrieval results in the absence of human relevance judgments. Accurately estimating the result of a search performed in response to a query has been extensively studied over the past two decades. With the rising popularity of virtual assistants along with evolving research on complex information needs, the need for reliable QPP methods as well as the number of potential applications significantly increases. In this work, we focus on improving the evaluation framework of QPP. As we see the existing evaluation as a considerable limitation in the improvement of QPP methods, a reliable and improved evaluation framework would constitute a stepping-stone for a breakthrough in QPP. The existing evaluation framework in QPP mainly relies on the measurement of the correlation coefficient between the per-query prediction scores and the actual per-query system effectiveness measure, usually Average Precision (AP). The QPP method that achieves higher correlation is considered to be superior. However, Hauff et al. demonstrate that higher correlation does not vouch for more accurate prediction. The authors additionally advocate the usage of Fisher's transformation and Confidence Intervals (CIs) to determine statistically significant differences between multiple correlation coefficients. Furthermore, the existing evaluation methodology is true only per a specific combination of a corpus, retrieval method, and set of queries; and does not necessarily hold if any of these is changed. That is, the existing evaluation is not agnostic to the different components, thus any conclusions about the relative prediction quality of the QPP methods should be taken with a grain of salt. In the proposed research we aim to develop a better evaluation technique to reliably compare the performance of QPP methods. We intend to develop a new evaluation framework and standards that will simultaneously enable the utilization of query variants and take into consideration other confounding factors in QPP evaluation. Specifically, we raise the following research questions: (i) What limitations exist in the current evaluation practices of QPP? (ii) What are the best approaches to perform detailed failure analysis of query performance predictor results? (iii) How do existing QPP methods differ in performance on a set of topics (distinct information needs) represented by a single query versus a set of multiple queries which represent the same information need? (iv) How do the existing and new evaluation methodologies align with user satisfaction? To answer the first two research questions Faggioli et al. proposed a new evaluation framework for QPP. In the proposed framework an error is calculated for each query, resulting in a distribution of per-query errors for a set of queries. The new distribution of errors enables the authors to apply an N-way ANalysis Of VAriance (ANOVA) followed by a post-hoc analysis, Tukey's Honestly Significant Difference (HSD) test, to determine statistically significant differences between the multiple factors involved in the QPP evaluation. Separating the different components in the evaluation process allows reaching more reliable conclusions regarding the effects of each component in the prediction process. As a preliminary study, Zendel et al. compared multiple existing QPP methods in the aforementioned tasks; predicting the effectiveness for different queries representing different topics, and different query variants, that represent the same topic. They found that the difference in AP between the queries is an important confounding factor, that affects the prediction quality. Future work will focus on developing a reliable evaluation framework for QPP both for queries from different topics and query variants from the same topic. A suitable framework should enable rigorous statistical analysis with decomposition and quantification of the different factors that affect QPP. In addition, a subsequent user study will explore how the new evaluation framework aligns with user satisfaction of QPP results.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::fc2d9f4f35bcb7b54ed0c13584b614cf https://doi.org/10.1145/3404835.3463270 Zobrazit plný text záznamu