ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.
Autor: | Danehy T; Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States., Hecht J; Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States., Kentis S; Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States., Schechter CB; Department of Family and Social Medicine, Albert Einstein College of Medicine, Bronx, New York, United States., Jariwala SP; Division of Allergy/Immunology, Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States. |
---|---|
Jazyk: | angličtina |
Zdroj: | Applied clinical informatics [Appl Clin Inform] 2024 Oct; Vol. 15 (5), pp. 1049-1055. Date of Electronic Publication: 2024 Aug 29. |
DOI: | 10.1055/a-2405-0138 |
Abstrakt: | Objectives: The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version. Methods: Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation. Results: Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( p < 0.001) on medical ethics and 33% points ( p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response. Conclusion: Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education. Competing Interests: None declared. (Thieme. All rights reserved.) |
Databáze: | MEDLINE |
Externí odkaz: |