MedConceptsQA: Open source medical concepts QA benchmark.
Autor: | Shoham OB; Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel. Electronic address: benshoho@post.bgu.ac.il., Rappoport N; Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel. Electronic address: nadavrap@bgu.ac.il. |
---|---|
Jazyk: | angličtina |
Zdroj: | Computers in biology and medicine [Comput Biol Med] 2024 Nov; Vol. 182, pp. 109089. Date of Electronic Publication: 2024 Sep 13. |
DOI: | 10.1016/j.compbiomed.2024.109089 |
Abstrakt: | Background: Clinical data often includes both standardized medical codes and natural language texts. This highlights the need for Clinical Large Language Models to understand these codes and their differences. We introduce a benchmark for evaluating the understanding of medical codes by various Large Language Models. Methods: We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conduct evaluations of the benchmark using various Large Language Models. Results: Our findings show that most of the pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of 9-11% (9% for few-shot learning and 11% for zero-shot learning) compared to Llama3-OpenBioLLM-70B, the clinical Large Language Model that achieved the best results. Conclusion: Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. We demonstrate that most of the current state-of-the-art clinical Large Language Models achieve random guess performance, whereas GPT-3.5, GPT-4, and Llama3-70B outperform these clinical models, despite their primary focus during pre-training not being on the medical domain. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA. Competing Interests: Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. (Copyright © 2024 The Author(s). Published by Elsevier Ltd.. All rights reserved.) |
Databáze: | MEDLINE |
Externí odkaz: |