Additional file 1 of Probing transcription factor combinatorics in different promoter classes and in enhancers

Autor: Vandel, Jimmy, Cassan, Océane, Lèbre, Sophie, Charles-Henri Lecellier, Bréhélin, Laurent
Rok vydání: 2019
DOI: 10.6084/m9.figshare.7664789
Popis: Figure S1. Comparison of the accuracy of the different approaches on the 409 experiments in the non expression-controlled challenge for promoters. (a) TRAP vs. Best hit, (b) DNA shape vs. Best hit, (c) TFcoop vs. Best hit, (d) TFcoop vs. DNA shape. Figure S2. ROC curves obtained on mRNA promoters for the 409 ChIP-seq experiments (non expression-controlled challenge). Figure S3. Link between the number of training sequences (x-axis) and model AUCs (y-axis). Figure S4. Comparison of AUCs achieved when using nucleotide and dinucleotide frequencies only (x-axis) and when using nucleotide, di-, tri-, and quadri-nucleotide frequencies (y-axis). Figure S5. Comparison of AUCs achieved with the JASPAR (complete), JASPAR (non-redundant), CisBP and HOCOMOCO databases of PWMs. Figure S6. Comparison of AUCs achieved on ENCODE and non-ENCODE data. Each column corresponds to a TFcoop model learned on a specific ENCODE ChIP-seq experiment. Black points correspond to AUC achieved when using these models on other ENCODE ChIP-seq data targeting the same TF, while red triangles correspond to the AUC achieved when using these models on a non-ENCODE ChIP-seq targeting the same TF. Globally, AUCs achieved on non-ENCODE data are in the range of the AUCs achieved on ENCODE data. Figure S7. Enrichment of three different PWM classes in the selected PWMs of promoter (up) and enhancer (down) models. For these analyses, PWMs were ranked according to the number of times they have been selected in promoter and enhancer models, and the GSEA method was applied to identify over-represented PWM classes among most used PWMs. Figure S8. Mean rank of the selected dinucleotides in promoter models according to the dinucleotide composition of the corresponding target PWM. For each model, the 16 dinucleotide variables were ordered according to their frequency in the target PWM. Then, the rank of each dinucleotide was averaged for all models. High mean rank thus indicates that, when selected, the dinucleotide was also frequent in the target PWM. Figure S9. Enrichment of pioneer factors among selected PWMs for promoters (a) and enhancers (b). For these analyses, PWMs were ranked according to the number of times they have been selected in promoter and enhancer models, and the GSEA method has been applied to compute the enrichment of pioneers among most used PWMs. Figure S10. (Up): Heatmap of the selected variables in the 409 logistic models learned on the mRNA promoters in the expression-controlled challenge. Each column corresponds to one of the logistic model, while the rows represent the variables used in the models (PWM affinity scores and mono- and di-nucleotide frequencies). Models (columns) have been partitioned in 5 different classes (represented by different colors on the top line) by a k-means algorithm. The number of classes 5 was empirically chosen because it shows good trade-off between modelling and complexity. (Down): Trade-off between modelling and complexity. This figure reports the average distance (y-axis) between points in the same class, according to the number of classes of the classification (x-axis). Until 5 classes, we can observe substantial decrease of the average distance between points, while after 5 classes the decrease is slighter and almost linear. Figure S11. The 30 most common variables in the five classes of models represented in Additional file 1: Figure 10. Each bar represents the proportion of models (in the class) which use the considered variable. Dark bars represent TFs classified as “pioneers factors” in the reference [9], while pale bars correspond to TF classified as “settler” or “migrant” in the same publication. Plain bars correspond to non-classified TFs as well as to mono- or di-nucleotides. Figure S12. AT rate distributions of selected PWMs in mRNA promoter models (with β>0). For each cluster we keep one model per target PWM to avoid bias due to overrepresentation of some PWMs. As cluster 4 is only composed of CTCF models, the distribution associated with this cluster is represented by a vertical segment on the x-axis. Figure S13. Distribution of methylation binding influence in selected PWMs of mRNA promoter models. We kept one model for each target PWM to avoid bias due to over-representation of the same PWM in certain classes. In grey is represented the distribution of all PWMs associated with a methylation class originally defined in reference [51] (190 over 520 non redundant PWMs). “Little” designates TFs recognizing CpG-containing sequences, but methylation of the CpG has little effect on binding. “MethylMinus” refers to TFs, which do not bind to, or more weakly to, methylated versions of their recognition sequences. Conversely, TFs that prefer to bind to methylated sequences over the corresponding unmethylated sequence belong to the “MethylPlus” class. see [51] for further details. Figure S14. Distribution of the number of mRNA and miRNA promoters overlapping a ChIP-seq peak in the 409 ChIP-seq experiments. Figure S15. Promoter models are interchangeable. Left: AUC comparaison of models learned and applied on lncRNAs and of models learned on mRNAs and applied on lncRNAs. Right: AUC comparaison of models learned and applied on pri-miRNAs and of models learned on mRNAs and applied on pri-miRNAs. Figure S16. Comparison of the accuracy of the different approaches on the 409 experiments in the non expression-controlled challenge for enhancers. (a) TRAP vs. Best hit, (b) DNA shape vs. Best hit, (c) TFcoop vs. Best hit, (d) TFcoop vs. DNA shape. Figure S17. Distribution of Gini coefficients computed for 53,220 gene promoters and 65,423 FANTOM5 enhancers on 1827 and 1897 samples, respectively. Gini coeficient is a measure of statistical dispersion which can be used to measure gene ubiquity: value 0 represents genes expressed in all samples, while value 1 represents genes expressed in only one sample. Figure S18. Heatmap of the selected variables in the 409 logistic models learned on the mRNA enhancers in the expression-controlled challenge. Each column corresponds to one of the logistic model, while the rows represent the variables used in the models (PWM affinity scores and mono- and di-nucleotide frequencies). Models (columns) have been partitioned in 6 different classes (represented by different colors on the top line) by a k-means algorithm. Figure S19. The 30 most common variables in the six classes of models represented in Additional file 1: Figure S18. Each bar represents the proportion of models (in the class) which use the considered variable. Dark bars represent TFs classified as “pioneers factors” in the reference [9], while pale bars correspond to TF classified as “settler” or “migrant” in the same publication. Plain bars corresponds to non-classified TFs as well as to mono- or di-nucleotides. Figure S20. AT rate distributions of selected PWMs in enhancer models (with β>0). For each cluster we keep one model per target PWM to avoid bias due to overrepresentation of some PWMs. Figure S21. Dotplot of the AUCs computed on mRNA promoter and on enhancers for the same ChIP-seq experiment. Table S1. Variables that are more selected in the non-controlled models than in the corresponding expression-controlled models in promoters (left) and enhancers (right). # ¬contr.: number of non controlled models that involve each variable. # contr.: number of corresponding expression-controlled models that also involve the variable. P-values were computed by hypergeometric tests. Table S2. Variables that are differentially selected in promoters and enhancers. (left) variables more selected in promoter models than in enhancers. (right) variables more selected in enhancer models than in promoters. # promo: number of promoter models involving this variable. # enhancer: number of enhancer models involving this variable. P-values were computed by chi2 test. (PDF 4198 kb)
Databáze: OpenAIRE