Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings.

Autor:	Hackl, Veronika, Müller, Alexandra Elena, Granitzer, Michael, Sailer, Maximilian
Předmět:	LANGUAGE models INTRACLASS correlation INTER-observer reliability
Zdroj:	Frontiers in Education; 2023, p1-8, 8p
Abstrakt:	This study reports the Intraclass Correlation Coefficients of feedback ratings produced by OpenAI's GPT-4, a large language model (LLM), across various iterations, time frames, and stylistic variations. The model was used to rate responses to tasks related to macroeconomics in higher education (HE), based on their content and style. Statistical analysis was performed to determine the absolute agreement and consistency of ratings in all iterations, and the correlation between the ratings in terms of content and style. The findings revealed high interrater reliability, with ICC scores ranging from 0.94 to 0.99 for different time periods, indicating that GPT-4 is capable of producing consistent ratings. The prompt used in this study is also presented and explained. [ABSTRACT FROM AUTHOR]
Databáze:	Complementary Index
Externí odkaz:	Zobrazit plný text záznamu