Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering

Autor: Raquel C. de Melo-Minardi, Elisa Boari de Lima, Wagner Meira
Jazyk: angličtina
Rok vydání: 2016
Předmět:
Protein Structure Comparison
0301 basic medicine
Pathology and Laboratory Medicine
computer.software_genre
Biochemistry
Sequence Analysis
Protein

Medicine and Health Sciences
Macromolecular Structure Analysis
Cluster Analysis
Databases
Protein

Dehydration (Medicine)
lcsh:QH301-705.5
Protein subfamily
Ecology
Proteases
Spectral clustering
Enzymes
Identification (information)
Computational Theory and Mathematics
Modeling and Simulation
Data mining
Sequence Analysis
Algorithms
Adenylyl Cyclase
Research Article
Protein Structure
Protein family
Protein domain
Lyases
Sequence alignment
Computational biology
Biology
Research and Analysis Methods
03 medical and health sciences
Cellular and Molecular Neuroscience
Signs and Symptoms
Protein Domains
Similarity (network science)
Diagnostic Medicine
Genetics
Amino Acid Sequence
Molecular Biology Techniques
Sequencing Techniques
Molecular Biology
Ecology
Evolution
Behavior and Systematics

Computational Biology
Proteins
Biology and Life Sciences
Hierarchical clustering
030104 developmental biology
lcsh:Biology (General)
Enzymology
Serine Proteases
Sequence Alignment
Protein Kinases
computer
Zdroj: PLoS Computational Biology, Vol 12, Iss 6, p e1005001 (2016)
PLoS Computational Biology
ISSN: 1553-7358
Popis: As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem’s complexity. Hence, this work’s purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.
Author Summary The knowledge of protein functions is central for understanding life at a molecular level and has huge biochemical and pharmaceutical implications. However, despite best research efforts, a substantial and ever-increasing number of proteins predicted by genome sequencing projects still lack functional annotations. Computational methods are required to determine protein functions quickly and reliably since experimental investigation is difficult and costly. Considering literature shows combining various types of information is crucial for functionally annotating proteins, such methods must be able to integrate data from different sources which may be scattered, non-standardized, incomplete, and noisy. Many protein families are composed of proteins with different folds and functions. In such cases, the division into subtypes which share specific functions uncommon to the family as a whole may lead to important information about the function and structure of a related protein of unknown function, as well as about the functional diversification acquired by the family during evolution. This work’s purpose is to automatically detect isofunctional subfamilies in a protein family of unknown function, as well as identify residues responsible for differentiation. We integrate data and then provide it to a clustering algorithm, which creates clusters of similar proteins we found correspond to same-specificity subfamilies.
Databáze: OpenAIRE