Uncovering Distant Protein Relationships with Deep Generative Models

Autor: Draizen, Eli J
Rok vydání: 2020
DOI: 10.5281/zenodo.6321395
Popis: Recent advances in protein structure determination and prediction offer new opportunities to decipher relationships amongst proteins—a task that entails 3D structure comparison and classification. Historically, protein domain classification has been somewhat manual and heuristic. While CATH and related resources represent significant steps towards a more systematic and automatable approach, more scalable and objective classification methods will enable a fuller exploration of protein structure or 'fold' space. Comparative analyses of protein structure latent spaces may uncover distant relationships, and will potentially entail a large-scale restructuring of traditional classification schemes. We have developed 3D convolutional variational autoencoders to 'define' ideal geometries and biophysical properties of proteins at CATH’s homologous superfamily (SF) level. To quantitatively evaluate pairwise 'distances' between SFs, we built one model per SF and compared the evidence lower bound (ELBO) loss functions of the models when evaluated with different SF structure representatives. Identifying patterns with these distances with Stochastic Block Models provides a new view of protein interrelationships—a view that extends beyond simple structural/geometric similarity, towards the realm of structure/function properties, and that is consistent with a recently proposed 'Urfold' model of protein relationships. For my first aim, I propose to develop a community resource to create and share protein properties--structural, biophysical and evolutionary. These properties can be used as feature-sets in any machine learning model; besides reusability and efficiency, such a resource would also facilitate more reproducible workflows, by ensuring analyses are performed with standardized data. In my second aim, I will design a sequence-independent, alignment-free, rotationally-invariant similarity metric of proteins based on Deep Generative Models and 3D structures. This framework will leverage similarities in latent-spaces rather than the 3D structures directly, and it will encode biophysical properties; this capability, in turn, will allow higher orders of similarity to be detected among proteins that are presumed to be only distantly related. Finally, my third aim will explore a new approach to detect clusters, or ‘communities’, of similar protein structures using Stochastic Block Models. This method takes a rather different approach to traditional clustering, allowing for proteins to span multiple clusters, thereby allowing for the continuous nature of fold space.
Databáze: OpenAIRE