Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text.

Autor: Rehana H; Department of Computer Science, School of Electrical Engineering & Computer Science, University of North Dakota, Grand Forks, ND 58202, United States.; Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, United States., Çam NB; Department of Computer Engineering, Bogazici University, Istanbul 34342, Turkey., Basmaci M; Department of Computer Engineering, Bogazici University, Istanbul 34342, Turkey., Zheng J; Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI 48109, United States., Jemiyo C; Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, United States., He Y; Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI 48109, United States.; Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, United States., Özgür A; Department of Computer Engineering, Bogazici University, Istanbul 34342, Turkey., Hur J; Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, United States.
Jazyk: angličtina
Zdroj: Bioinformatics advances [Bioinform Adv] 2024 Sep 11; Vol. 4 (1), pp. vbae133. Date of Electronic Publication: 2024 Sep 11 (Print Publication: 2024).
DOI: 10.1093/bioadv/vbae133
Abstrakt: Motivation: Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. As biomedical literature continues to grow rapidly, there is an increasing need for automated and accurate extraction of these interactions to facilitate scientific discovery. Pretrained language models, such as generative pretrained transformers and bidirectional encoder representations from transformers, have shown promising results in natural language processing tasks.
Results: We evaluated the performance of PPI identification using multiple transformer-based models across three manually curated gold-standard corpora: Learning Language in Logic with 164 interactions in 77 sentences, Human Protein Reference Database with 163 interactions in 145 sentences, and Interaction Extraction Performance Assessment with 335 interactions in 486 sentences. Models based on bidirectional encoder representations achieved the best overall performance, with BioBERT achieving the highest recall of 91.95% and F1 score of 86.84% on the Learning Language in Logic dataset. Despite not being explicitly trained for biomedical texts, GPT-4 showed commendable performance, comparable to the bidirectional encoder models. Specifically, GPT-4 achieved the highest precision of 88.37%, a recall of 85.14%, and an F1 score of 86.49% on the same dataset. These results suggest that GPT-4 can effectively detect protein interactions from text, offering valuable applications in mining biomedical literature.
Availability and Implementation: The source code and datasets used in this study are available at https://github.com/hurlab/PPI-GPT-BERT.
Competing Interests: None declared.
(© The Author(s) 2024. Published by Oxford University Press.)
Databáze: MEDLINE