Protein embedding based alignment

Autor: Benjamin Giovanni Iovino, Yuzhen Ye
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: BMC Bioinformatics, Vol 25, Iss 1, Pp 1-16 (2024)
Druh dokumentu: article
ISSN: 1471-2105
DOI: 10.1186/s12859-024-05699-5
Popis: Abstract Purpose Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20–35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with
Databáze: Directory of Open Access Journals