Biomedical knowledge graph-optimized prompt generation for large language models.

Autor: Soman K; Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States., Rose PW; San Diego Supercomputer Center, University of California, San Diego, CA 92093, United States., Morris JH; Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, CA 94158, United States., Akbas RE; Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States., Smith B; Institute for Systems Biology, Seattle, WA 98109, United States., Peetoom B; Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States., Villouta-Reyes C; Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States., Cerono G; Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States., Shi Y; Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, United States., Rizk-Jackson A; Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, United States., Israni S; Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, United States., Nelson CA; Mate Bioservices, Inc. Swallowtail Ct., Brisbane, CA 94005, United States., Huang S; Institute for Systems Biology, Seattle, WA 98109, United States., Baranzini SE; Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States.
Jazyk: angličtina
Zdroj: Bioinformatics (Oxford, England) [Bioinformatics] 2024 Sep 02; Vol. 40 (9).
DOI: 10.1093/bioinformatics/btae560
Abstrakt: Motivation: Large language models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains such as biomedicine. Solutions such as pretraining and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4, to generate meaningful biomedical text rooted in established knowledge.
Results: Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.
Availability and Implementation: SPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository.
(© The Author(s) 2024. Published by Oxford University Press.)
Databáze: MEDLINE