Implementation of machine learning models to determine the appropriate model for protein function prediction
Autor: | Yekaterina Golenko, Aisulu Ismailova, Anargul Shaushenova, Zhazira Mutalova, Damir Dossalyanov, Aliya Ainagulova, Akgul Naizagarayeva |
---|---|
Rok vydání: | 2022 |
Předmět: |
Applied Mathematics
Mechanical Engineering Energy Engineering and Power Technology bidirectional long short-term memory (BiLSTM) neural networks ProtCNN Industrial and Manufacturing Engineering Computer Science Applications classification Control and Systems Engineering Management of Technology and Innovation Environmental Chemistry Electrical and Electronic Engineering protein function prediction Food Science |
Zdroj: | Eastern-European Journal of Enterprise Technologies. 5:42-49 |
ISSN: | 1729-4061 1729-3774 |
DOI: | 10.15587/1729-4061.2022.263270 |
Popis: | Predicting the function of proteins is a crucial part of genome annotation, which can help in solving a wide range of biological problems. Many methods are available to predict the functions of proteins. However, except for sequence, most features are difficult to obtain or are not available for many proteins, which limits their scope. In addition, the performance of sequence-based feature prediction methods is often lower than that of methods that involve multiple features, and protein feature prediction can be time-consuming. Recent advances in this field are associated with the development of machine learning, which shows great progress in solving the problem of predicting protein functions. Today, however, most protein sequences have the status of «uncharacterized» or «putative». The need to assess the accuracy of identification of protein functions is an urgent task for machine learning approaches used to predict protein functions. In this study, the performance of two popular function prediction algorithms (ProtCNN and BiLSTM) was assessed from two perspectives and the procedures for building these models were described. As a result of the study of Pfam families, ProtCNN achieves an accuracy rate of 0.988% and bidirectional LSTM has an accuracy rate of 0.9506%. The use of the Pfam dataset allowed increasing the classification accuracy due to the large training dataset. The quality of the prediction increases with a large amount of training data. The study demonstrated that machine learning algorithms can be used as an effective tool for building protein function prediction models, in particular, the CNN network can be adapted as an accurate tool for annotating protein functions in the presence of large datasets. |
Databáze: | OpenAIRE |
Externí odkaz: |