Popis: |
Elucidating the relationship between the sequences of non-coding regulatory elements and their target genes is key to understanding gene regulation and its variation between plant species and ecotypes. In this study, we developed deep learning models that link gene sequence data with mRNA copy number for the plant species Arabidopsis thaliana, Sorghum bicolor, Solanum lycopersicum and Zea mays, and predicted the regulatory effect of gene sequence variation. Our models achieved over 80% accuracy in the species-specific and multi-species prediction tasks and enabled predictive feature selection within the input regulatory sequences. Saliency scores of the model highlighted a set of expression-predictive sequence features and the profound importance of the UTR regions in determining the level of gene expression. Identified sequence features exhibited remarkable conservation across plant species and achieved more than 70% accuracy in cross-species expression prediction. We demonstrated the application of our model on 14 newly assembled tomato genomes, where the effect of structural genetic variation on gene expression is annotated. Finally, we showed that by providing an accurate prediction of differences in the expression of biosynthetic enzymes and their individual homologs, the model highlights known metabolic differences between related genotypes. This was demonstrated for biosynthetic pathways of stress-related compounds in Solanum lycopersicum and its wild drought-resistant relative Solanum pennellii. |