Cross-platform DNA motif discovery and benchmarking to explore binding specificities of poorly studied human transcription factors.

Autor: Vorontsov IE; Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia.; Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia., Kozin I; Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia., Abramov S; Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia.; Altius Institute for Biomedical Sciences, 98121, Seattle, WA, USA., Boytsov A; Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia.; Altius Institute for Biomedical Sciences, 98121, Seattle, WA, USA., Jolma A; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Albu M; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Ambrosini G; École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland., Faltejskova K; Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, 160 00 Praha 6, Czech Republic.; Computer Science Institute, Faculty of Mathematics and Physics, Charles University, 118 00 Praha 1, Czech Republic., Gralak AJ; Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland.; Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland., Gryzunov N; Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia.; Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia., Inukai S; Chugai Pharmaceutical Co., Ltd, Tokyo, 103-8324, Japan., Kolmykov S; Department of Computational Biology, Sirius University of Science and Technology, 354340, Sirius, Krasnodar region, Russia., Kravchenko P; Max Planck Institute of Biochemistry, 82152, Planegg, Germany., Kribelbauer-Swietek JF; Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland.; Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland., Laverty KU; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Nozdrin V; Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia.; Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia., Patel ZM; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Penzar D; Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia., Plescher ML; Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany., Pour SE; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Razavi R; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Yang AWH; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Yevshin I; Biosoft.Ru LLC, 630058, Novosibirsk, Russia., Zinkevich A; Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia., Weirauch MT; Cincinnati Children's Hospital, Cincinnati, OH 45229, USA., Bucher P; Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland., Deplancke B; Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland.; Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland., Fornes O; Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC V5Z 4H4, Canada., Grau J; Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany., Grosse I; Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany., Kolpakov FA; Department of Computational Biology, Sirius University of Science and Technology, 354340, Sirius, Krasnodar region, Russia.; Bioinformatics Laboratory, Federal Research Center for Information and Computational Technologies, 630090, Novosibirsk, Russia., Makeev VJ; Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia.; Moscow Center for Advanced Studies, 123592, Moscow, Russia., Hughes TR; Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada., Kulakovskiy IV; Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia.; Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia.; Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia.
Jazyk: angličtina
Zdroj: BioRxiv : the preprint server for biology [bioRxiv] 2024 Nov 13. Date of Electronic Publication: 2024 Nov 13.
DOI: 10.1101/2024.11.11.619379
Abstrakt: A DNA sequence pattern, or "motif", is an essential representation of DNA-binding specificity of a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings of the underlying experimental data and computational motif discovery algorithm. As a part of the Codebook/GRECO-BIT initiative, here we evaluated at large scale the cross-platform recognition performance of positional weight matrices (PWMs), which remain popular motif models in many practical applications. We applied ten different DNA motif discovery tools to generate PWMs from the "Codebook" data comprised of 4,237 experiments from five different platforms profiling the DNA-binding specificity of 394 human proteins, focusing on understudied transcription factors of different structural families. For many of the proteins, there was no prior knowledge of a genuine motif. By benchmarking-supported human curation, we constructed an approved subset of experiments comprising about 30% of all experiments and 50% of tested TFs which displayed consistent motifs across platforms and replicates. We present the Codebook Motif Explorer (https://mex.autosome.org), a detailed online catalog of DNA motifs, including the top-ranked PWMs, and the underlying source and benchmarking data. We demonstrate that in the case of high-quality experimental data, most of the popular motif discovery tools detect valid motifs and generate PWMs, which perform well both on genomic and synthetic data. Yet, for each of the algorithms, there were problematic combinations of proteins and platforms, and the basic motif properties such as nucleotide composition and information content offered little help in detecting such pitfalls. By combining multiple PMWs in decision trees, we demonstrate how our setup can be readily adapted to train and test binding specificity models more complex than PWMs. Overall, our study provides a rich motif catalog as a solid baseline for advanced models and highlights the power of the multi-platform multi-tool approach for reliable mapping of DNA binding specificities.
Competing Interests: Competing interests O.F. is employed by Roche.
Databáze: MEDLINE