Protein Sequence Classification Using Bidirectional Encoder Representations from Transformers (BERT) Approach

Autor: Balamurugan, R., Mohite, Saurabh, Raja, S. P.
Zdroj: SN Computer Science; September 2023, Vol. 4 Issue: 5
Abstrakt: Proteins play a vital role by booming out a number of activities within an organism to withstand its life. The field of Natural Language Processing has successfully adapted deep learning to get a better insight into the semantic nature of languages. In this paper, we propose semantic approaches based on deep learning to work with protein amino acid sequences and compare the performances of these approaches with traditional classifiers to predict their respective families. The Bidirectional Encoder Representations from Transformers (BERT) approach was tested over 103 protein families from UniProt consortium database. The results show the average prediction accuracy to 99.02%, testing accuracy to 97.70%, validation accuracy to 97.69%, Normalized Mutual Information (NMI) score on overall data to 98.45, on test data to 96.99, on validation data 96.93 with high weighted average F1 scores of 99.02 on overall data, 97.72 on test data and 97.70 on validation data, and high macro average F1 scores of 99.00 on overall data, 98.00 on test data and 98.00 on validation data. From the results, it is justified that our proposed approach is outperforming well when compared to the existing approaches.
Databáze: Supplemental Index