Contrasting Human Opinion of Non-factoid Question Answering with Automatic Evaluation

Autor:	Tianbo Ji, Gareth J. F. Jones, Yvette Graham
Rok vydání:	2020
Předmět:	Computer science business.industry Factoid computer.software_genre Automatic summarization Field (computer science) Task (project management) Comprehension Test set Question answering Artificial intelligence Evaluation of machine translation business computer Natural language processing
Zdroj:	CHIIR
Popis:	Evaluation in non-factoid question answering tasks generally takes the form of computation of automatic metric scores for systems on a sample test set of questions against human-generated reference answers. Conclusions drawn from the scores produced by automatic metrics inevitably lead to important decisions about future directions. Metrics commonly applied include ROUGE, adopted from the related field of summarization, BLEU and Meteor, both of the latter originally developed for evaluation of machine translation. In this paper, we pose the important question, given that question answering is evaluated by application of automatic metrics originally designed for other tasks, to what degree do the conclusions drawn from such metrics correspond to human opinion about system-generated answers? We take the task of machine reading comprehension (MRC) as a case study and to address this question, provide a new method of human evaluation developed specifically for the task at hand.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::8013dfbc2803541ab8677bd3d15d1c43 https://doi.org/10.1145/3343413.3377996 Zobrazit plný text záznamu