Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank.

Autor: Kondo R; Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan., Kasahara K; College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan., Takahashi T; College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan.
Jazyk: angličtina
Zdroj: Biophysics and physicobiology [Biophys Physicobiol] 2022 Feb 08; Vol. 19, pp. 1-12. Date of Electronic Publication: 2022 Feb 08 (Print Publication: 2022).
DOI: 10.2142/biophysico.bppb-v19.0002
Abstrakt: Elucidating the principles of sequence-structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequences, can form different secondary structures depending on its environments. Chameleon sequences are considered to have a weak tendency to form a specific structure. Although many chameleon sequences have been identified, they are only a small part of all possible subsequences in the proteome. The strength of the tendency to take a specific structure for each subsequence has not been fully quantified. In this study, we comprehensively analyzed subsequences consisting of four to nine amino acid residues, or N -gram (4≤ N ≤9), observed in non-redundant sequences in the Protein Data Bank (PDB). Tendencies to form a specific structure in terms of the secondary structure and accessible surface area are quantified as information quantities for each N-gram . Although the majority of observed subsequences have low information quantity due to lack of samples in the current PDB, thousands of N -grams with strong tendencies, including known structural motifs, were found. In addition, machine learning partially predicted the tendency of unknown N -grams, and thus, this technique helps to extract knowledge from the limited number of samples in the PDB.
(2022 THE BIOPHYSICAL SOCIETY OF JAPAN.)
Databáze: MEDLINE