Sequence and structure based deep learning models represent different aspects of protein biochemistry

Autor: Anastasiya V. Kulikova, Daniel J. Diaz, Tianlong Chen, T. Jeffrey Cole, Andrew D. Ellington, Claus O. Wilke
Rok vydání: 2023
Předmět:
Zdroj: bioRxiv
Popis: Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations at various sites in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and one 3D CNN model and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between sequence and structure based models. Overall, the 3D CNN model is better at predicting buried aliphatic and hydrophobic residues whereas the LLMs are better at predicting solvent-exposed polar and charged amino acids. A combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Databáze: OpenAIRE