Contrasting Human Opinion of Non-factoid Question Answering with Automatic Evaluation

Autor: Tianbo Ji, Gareth J. F. Jones, Yvette Graham
Rok vydání: 2020
Předmět:
Zdroj: CHIIR
Popis: Evaluation in non-factoid question answering tasks generally takes the form of computation of automatic metric scores for systems on a sample test set of questions against human-generated reference answers. Conclusions drawn from the scores produced by automatic metrics inevitably lead to important decisions about future directions. Metrics commonly applied include ROUGE, adopted from the related field of summarization, BLEU and Meteor, both of the latter originally developed for evaluation of machine translation. In this paper, we pose the important question, given that question answering is evaluated by application of automatic metrics originally designed for other tasks, to what degree do the conclusions drawn from such metrics correspond to human opinion about system-generated answers? We take the task of machine reading comprehension (MRC) as a case study and to address this question, provide a new method of human evaluation developed specifically for the task at hand.
Databáze: OpenAIRE