Unsupervised Grammar Induction for Revealing the Internal Structure of Protein Sequence Motifs

Autor: Witold Dyrka, Mateusz Gabor, Olgierd Unold
Rok vydání: 2020
Předmět:
Zdroj: Artificial Intelligence in Medicine ISBN: 9783030591366
AIME
DOI: 10.1007/978-3-030-59137-3_27
Popis: Protein sequence motifs are conserved amino acid patterns of biological significance. They are vital for annotating structural and functional features of proteins. Yet, the computational methods commonly used for defining sequence motifs are typically simplified linear representations neglecting the higher-order structure of the motif. The purpose of the work is to create models of sequence motifs taking into account the internal structure of the modeled fragments. The ultimate goal is to provide the community with accurate and concise models of diverse collections of remotely related amino acid sequences that share structural features. The internal structure of amino acid sequences is modeled using a novel algorithm for unsupervised learning of weighted context-free grammar (WCFG). The proposed method learns WCFG both form positive and negative samples, whereas weights of rules are estimated using a novel Inside-Outside Contrastive Estimation algorithm. In comparison to existing approaches to learning CFG, the new method generates more concise descriptors and provides good control of the trade-off between grammar size and specificity. The method is applied to the nicotinamide adenine dinucleotide phosphate binding site motif.
Databáze: OpenAIRE