Can large language models provide accurate and quality information to parents regarding chronic kidney diseases?

Autor: Naz R; Bursa Yüksek Ihtisas Research and Training Hospital, University of Health Sciences, Bursa, Turkey., Akacı O; Clinic of Pediatric Nephrology, Bursa Yüksek Ihtisas Research and Training Hospital, University of Health Sciences, Bursa, Turkey., Erdoğan H; Clinic of Pediatric Nephrology, Bursa City Hospital, Bursa, Turkey., Açıkgöz A; Department of Pediatric Nursing, Faculty of Health Sciences, Eskişehir Osmangazi University, Eskişehir, Turkey.
Jazyk: angličtina
Zdroj: Journal of evaluation in clinical practice [J Eval Clin Pract] 2024 Jul 03. Date of Electronic Publication: 2024 Jul 03.
DOI: 10.1111/jep.14084
Abstrakt: Rationale: Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human-like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these language models have not been evaluated in specific medical fields.
Aims and Objectives: This study aimed to compare the performance of AI LLMs ChatGPT, Gemini and Copilot in providing information to parents about chronic kidney diseases (CKD) and compare the information accuracy and quality with that of a reference source.
Methods: In this study, 40 frequently asked questions about CKD were identified. The accuracy and quality of the answers were evaluated with reference to the Kidney Disease: Improving Global Outcomes guidelines. The accuracy of the responses generated by LLMs was assessed using F1, precision and recall scores. The quality of the responses was evaluated using a five-point global quality score (GQS).
Results: ChatGPT and Gemini achieved high F1 scores of 0.89 and 1, respectively, in the diagnosis and lifestyle categories, demonstrating significant success in generating accurate responses. Furthermore, ChatGPT and Gemini were successful in generating accurate responses with high precision values in the diagnosis and lifestyle categories. In terms of recall values, all LLMs exhibited strong performance in the diagnosis, treatment and lifestyle categories. Average GQ scores for the responses generated were 3.46 ± 0.55, 1.93 ± 0.63 and 2.02 ± 0.69 for Gemini, ChatGPT 3.5 and Copilot, respectively. In all categories, Gemini performed better than ChatGPT and Copilot.
Conclusion: Although LLMs provide parents with high-accuracy information about CKD, their use is limited compared with that of a reference source. The limitations in the performance of LLMs can lead to misinformation and potential misinterpretations. Therefore, patients and parents should exercise caution when using these models.
(© 2024 John Wiley & Sons Ltd.)
Databáze: MEDLINE