Popis: |
BackgroundInterest surrounding generative large language models (LLMs) has rapidly grown. While ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized exams and the factors affecting accuracy remain unclear.ObjectiveTo assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written boards examination.MethodsThe Self-Assessment Neurosurgery Exams (SANS) American Board of Neurological Surgery (ABNS) Self-Assessment Exam 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. Chi-squared, Fisher’s exact, and univariable logistic regression tests were employed to assess performance differences in relation to question characteristics.ResultsChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% confidence interval [CI]: 69.3-77.2%) and 83.4% (95% CI: 79.8-86.5%), respectively, relative to the user average of 73.7% (95% CI: 69.6-77.5%). Question bank users and both LLMs exceeded last year’s passing threshold of 69%. While scores between ChatGPT and question bank users were equivalent (P=0.963), GPT-4 outperformed both (bothPP=0.009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (bothP>0.005). Multimodal input was not available at the time of this study so, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based upon contextual context clues alone.ConclusionLLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT. |