Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings
Autor: | Veronika Hackl, Alexandra Elena Müller, Michael Granitzer, Maximilian Sailer |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2023 |
Předmět: | |
Zdroj: | Frontiers in Education, Vol 8 (2023) |
Druh dokumentu: | article |
ISSN: | 2504-284X 05698103 |
DOI: | 10.3389/feduc.2023.1272229 |
Popis: | This study reports the Intraclass Correlation Coefficients of feedback ratings produced by OpenAI's GPT-4, a large language model (LLM), across various iterations, time frames, and stylistic variations. The model was used to rate responses to tasks related to macroeconomics in higher education (HE), based on their content and style. Statistical analysis was performed to determine the absolute agreement and consistency of ratings in all iterations, and the correlation between the ratings in terms of content and style. The findings revealed high interrater reliability, with ICC scores ranging from 0.94 to 0.99 for different time periods, indicating that GPT-4 is capable of producing consistent ratings. The prompt used in this study is also presented and explained. |
Databáze: | Directory of Open Access Journals |
Externí odkaz: |