Popis: |
Protein sequences encode the information necessary for function and folding, however accessing it is not straightforward. Unluckily, although structure and function are tightly linked, the prediction of how specific amino acids contribute to these features is still considerably impaired. Here, we developed PhISCO, Phenotype Inference from Sequence COmparisons, a simple algorithm that finds positions associated with any quantitative phenotype and predicts their values. From a few hundred sequences from four different protein families, we performed multiple sequence alignments and calculated per-position pairwise differences for both the sequence and the observed phenotypes. For the case of Adenylate Kinase we found 3 positions in proximity of the ligand binding sites that are linked to their Optimal Growth Temperature. For microbial rhodopsins, we identified 10 positions close to their chromophore and the lipid-protein interface whose differences are correlated to the maximal absorption wavelength. In the case of myoglobins, 3 positions were identified that have been described as tightly linked to muscular myoglobin concentration. Finally, we identified positions associated with the inhibitory potency of 2 inhibitors of the HIV protease. We showed that strong correlations exist in single positions while an improvement is achieved when the most correlated positions are jointly analyzed. Noteworthy, we performed phenotype predictions using a simple linear model that links perposition divergences of most correlated positions and differences in observed phenotypes. The diversity of the explored systems make PhISCO valuable to find sequence determinants of biological activity modulation and to predict various functional features for uncharacterized members of a protein family. |