Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.

Autor: Dhanvijay AKD; Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND., Pinjar MJ; Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND., Dhokane N; Physiology, Government Medical College, Sindhudurg, Oros, IND., Sorte SR; Physiology, All India Institute of Medical Sciences, Nagpur, Nagpur, IND., Kumari A; Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND., Mondal H; Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND.
Jazyk: angličtina
Zdroj: Cureus [Cureus] 2023 Aug 04; Vol. 15 (8), pp. e42972. Date of Electronic Publication: 2023 Aug 04 (Print Publication: 2023).
DOI: 10.7759/cureus.42972
Abstrakt: Background Large language models (LLMs) have emerged as powerful tools capable of processing and generating human-like text. These LLMs, such as ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States), Google Bard (Alphabet Inc., CA, US), and Microsoft Bing (Microsoft Corporation, WA, US), have been applied across various domains, demonstrating their potential to assist in solving complex tasks and improving information accessibility. However, their application in solving case vignettes in physiology has not been explored. This study aimed to assess the performance of three LLMs, namely, ChatGPT (3.5; free research version), Google Bard (Experiment), and Microsoft Bing (precise), in answering cases vignettes in Physiology. Methods This cross-sectional study was conducted in July 2023. A total of 77 case vignettes in physiology were prepared by two physiologists and were validated by two other content experts. These cases were presented to each LLM, and their responses were collected. Two physiologists independently rated the answers provided by the LLMs based on their accuracy. The ratings were measured on a scale from 0 to 4 according to the structure of the observed learning outcome (pre-structural = 0, uni-structural = 1, multi-structural = 2, relational = 3, extended-abstract). The scores among the LLMs were compared by Friedman's test and inter-observer agreement was checked by the intraclass correlation coefficient (ICC). Results The overall scores for ChatGPT, Bing, and Bard in the study, with a total of 77 cases, were found to be 3.19±0.3, 2.15±0.6, and 2.91±0.5, respectively, p<0.0001. Hence, ChatGPT 3.5 (free version) obtained the highest score, Bing (Precise) had the lowest score, and Bard (Experiment) fell in between the two in terms of performance. The average ICC values for ChatGPT, Bing, and Bard were 0.858 (95% CI: 0.777 to 0.91, p<0.0001), 0.975 (95% CI: 0.961 to 0.984, p<0.0001), and 0.964 (95% CI: 0.944 to 0.977, p<0.0001), respectively. Conclusion ChatGPT outperformed Bard and Bing in answering case vignettes in physiology. Hence, students and teachers may think about choosing LLMs for their educational purposes accordingly for case-based learning in physiology. Further exploration of their capabilities is needed for adopting those in medical education and support for clinical decision-making.
Competing Interests: The authors have declared that no competing interests exist.
(Copyright © 2023, Dhanvijay et al.)
Databáze: MEDLINE