MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Autor: Jarkko Toivonen, Esko Ukkonen, Jussi Taipale, Pratyush Kumar Das
Přispěvatelé: Department of Computer Science, University of Helsinki, ATG - Applied Tumor Genomics, Research Programs Unit, Jussi Taipale / Principal Investigator
Rok vydání: 2019
Předmět:
Statistics and Probability
Orientation (graph theory)
Markov model
SEQUENCE
Biochemistry
03 medical and health sciences
chemistry.chemical_compound
0302 clinical medicine
EM ALGORITHM
Position (vector)
Expectation–maximization algorithm
Order (group theory)
Position-Specific Scoring Matrices
TRANSCRIPTION FACTOR
POSITION
Nucleotide Motifs
SPECIFICITY
Molecular Biology
030304 developmental biology
Mathematics
11832 Microbiology and virology
SITES
0303 health sciences
Sequence
Binding Sites
IDENTIFICATION
RECOGNITION
PROTEIN-DNA INTERACTIONS
113 Computer and information sciences
Mixture model
Original Papers
Computer Science Applications
Computational Mathematics
Monomer
Computational Theory and Mathematics
chemistry
1182 Biochemistry
cell and molecular biology

Biological system
Sequence Analysis
030217 neurology & neurosurgery
Algorithms
Software
Protein Binding
Transcription Factors
Zdroj: Bioinformatics
ISSN: 1367-4811
Popis: Motivation Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. Results We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. Availability and implementation Software implementation is available from https://github.com/jttoivon/moder2. Supplementary information Supplementary data are available at Bioinformatics online.
Databáze: OpenAIRE