Protein embeddings improve phage-host interaction prediction.

Autor: Gonzales MEM; Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines.; Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines., Ureta JC; Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines.; Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines., Shrestha AMS; Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines.; Systems and Computational Biology Research Unit, Center for Natural Sciences and Environmental Research, De La Salle University, Manila, Philippines.; Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines.
Jazyk: angličtina
Zdroj: PloS one [PLoS One] 2023 Jul 24; Vol. 18 (7), pp. e0289030. Date of Electronic Publication: 2023 Jul 24 (Print Publication: 2023).
DOI: 10.1371/journal.pone.0289030
Abstrakt: With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
Competing Interests: The authors have declared that no competing interests exist.
(Copyright: © 2023 Gonzales et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
Databáze: MEDLINE
Nepřihlášeným uživatelům se plný text nezobrazuje