ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

Autor: Danehy T; Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States., Hecht J; Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States., Kentis S; Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States., Schechter CB; Department of Family and Social Medicine, Albert Einstein College of Medicine, Bronx, New York, United States., Jariwala SP; Division of Allergy/Immunology, Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States.
Jazyk: angličtina
Zdroj: Applied clinical informatics [Appl Clin Inform] 2024 Oct; Vol. 15 (5), pp. 1049-1055. Date of Electronic Publication: 2024 Aug 29.
DOI: 10.1055/a-2405-0138
Abstrakt: Objectives:  The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.
Methods:  Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.
Results:  Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( p  < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( p  = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( p  < 0.001) on medical ethics and 33% points ( p  < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.
Conclusion:  Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.
Competing Interests: None declared.
(Thieme. All rights reserved.)
Databáze: MEDLINE