Implementation of machine learning models to determine the appropriate model for protein function prediction

Autor: Yekaterina Golenko, Aisulu Ismailova, Anargul Shaushenova, Zhazira Mutalova, Damir Dossalyanov, Aliya Ainagulova, Akgul Naizagarayeva
Rok vydání: 2022
Předmět:
Zdroj: Eastern-European Journal of Enterprise Technologies. 5:42-49
ISSN: 1729-4061
1729-3774
DOI: 10.15587/1729-4061.2022.263270
Popis: Predicting the function of proteins is a crucial part of genome annotation, which can help in solving a wide range of biological problems. Many methods are available to predict the functions of proteins. However, except for sequence, most features are difficult to obtain or are not available for many proteins, which limits their scope. In addition, the performance of sequence-based feature prediction methods is often lower than that of methods that involve multiple features, and protein feature prediction can be time-consuming. Recent advances in this field are associated with the development of machine learning, which shows great progress in solving the problem of predicting protein functions. Today, however, most protein sequences have the status of «uncharacterized» or «putative». The need to assess the accuracy of identification of protein functions is an urgent task for machine learning approaches used to predict protein functions. In this study, the performance of two popular function prediction algorithms (ProtCNN and BiLSTM) was assessed from two perspectives and the procedures for building these models were described. As a result of the study of Pfam families, ProtCNN achieves an accuracy rate of 0.988% and bidirectional LSTM has an accuracy rate of 0.9506%. The use of the Pfam dataset allowed increasing the classification accuracy due to the large training dataset. The quality of the prediction increases with a large amount of training data. The study demonstrated that machine learning algorithms can be used as an effective tool for building protein function prediction models, in particular, the CNN network can be adapted as an accurate tool for annotating protein functions in the presence of large datasets.
Databáze: OpenAIRE