Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings

Autor: Veronika Hackl, Alexandra Elena Müller, Michael Granitzer, Maximilian Sailer
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: Frontiers in Education, Vol 8 (2023)
Druh dokumentu: article
ISSN: 2504-284X
05698103
DOI: 10.3389/feduc.2023.1272229
Popis: This study reports the Intraclass Correlation Coefficients of feedback ratings produced by OpenAI's GPT-4, a large language model (LLM), across various iterations, time frames, and stylistic variations. The model was used to rate responses to tasks related to macroeconomics in higher education (HE), based on their content and style. Statistical analysis was performed to determine the absolute agreement and consistency of ratings in all iterations, and the correlation between the ratings in terms of content and style. The findings revealed high interrater reliability, with ICC scores ranging from 0.94 to 0.99 for different time periods, indicating that GPT-4 is capable of producing consistent ratings. The prompt used in this study is also presented and explained.
Databáze: Directory of Open Access Journals