Decoding the cis-regulatory information of enhancer sequences

Autor:	Lucas Carvalho Pereira De Almeida, Bernardo
Jazyk:	angličtina
Rok vydání:	2023
DOI:	10.25365/thesis.73741
Popis:	Die genauen Informationen darüber wann und wo jedes der etwa 20 000 menschlichen proteinkodierenden Gene exprimiert werden soll, sind in den DNA-Sequenzen der sogenannten „Enhancer“-Elemente kodiert. Enhancer sind genomische, nicht kodierende cis-regulierende Elemente, die als Ein- und Aus-Schalter der Gentranskription fungieren. Die überwiegende Mehrheit der krankheitsassoziierten Mutationen fällt in den nichtkodierenden Teil des Genoms und scheint sich besonders in Enhancern anzureichern und die Genregulation zu beeinträchtigen. Trotz der Bedeutung von Enhancern für Entwicklung und Krankheit, ist die Entschlüsselung des Zusammenhangs zwischen der Sequenz eines Enhancers und seiner regulatorischen Aktivität eine der größten Herausforderungen in der Biologie geblieben. Weder die Vorhersage der Enhancer-Aktivität noch die Entwicklung synthetischer Enhancer mit spezifischen Eigenschaften ist bisher gelungen. Ziel dieser Doktorarbeit war es, ein besseres Verständnis der in Enhancer-Sequenzen kodierten cis-regulatorischen Informationen zu erlangen, indem Deep-Learning-Algorithmen mit Hochdurchsatz-Enhancer-Tests und systematischen Enhancer-Sequenz-Perturbationsexperimenten kombiniert wurden, wobei Drosophila melanogaster S2-Zellen als Hauptmodellsystem verwendet werden. Zunächst entwickelte ich ein Deep-Learning-Modell - DeepSTARR, das die Enhancer-Aktivität einer beliebigen DNA-Sequenz und ihre kritischen Nukleotide vorhersagt und die Entwicklung synthetischer Enhancer de novo ermöglicht. Ich wandte diesen Ansatz auf S2-Zellen von Drosophila an und trainierte DeepSTARR, um den Code der Enhancer-Sequenz mit erhöhter Genauigkeit zu lernen. In einem zweiten Schritt habe ich das Modell interpretiert und Sequenzregeln für Enhancer ermittelt, wie zum Beispiel die Bedeutung von motivflankierenden Nukleotiden und Transkriptionsfaktor-Motiv-Abständen. Wir validierten diese Regeln experimentell und konnten ihre Erhaltung in menschlichen Enhancern nachgeweisen. Schließlich haben wir auch synthetische Enhancer mit den gewünschten Aktivitäten entworfen und funktionell validiert, was nicht nur den Nachweis für die Gültigkeit des Modells und seiner Regeln erbringt, sondern auch das Potenzial solcher Ansätze für die synthetische Biologie verdeutlicht. Des weiteren entwickelten wir ein groß angelegtes Enhancer-Mutagenese-Projekt, um die Regeln der Enhancer-Sequenzsyntax besser zu verstehen. Die daraus resultierenden Veränderungen der Enhancer-Aktivität bestätigten die vorhergesagten Sequenzmerkmale von DeepSTARR und zeigten, dass Enhancer eine eingeschränkte Sequenzflexibilität aufweisen. Nur eine bestimmte, aber dennoch vielfältige Gruppe von Sequenzen und TF-Motiven kann an einer bestimmten Position funktionieren. Diese Aktivität von Motiven an bestimmten Positionen wird stark durch den Kontext der Enhancer-Sequenz bestimmt, d. h. durch die flankierende Sequenz, das Vorhandensein und die Vielfalt anderer Motivtypen und den Abstand zwischen den Motiven. Insgesamt hat meine Arbeit das Potenzial als Grundlage für aktuelle und künftige Bemühungen zu dienen, die im menschlichen Genom kodierte regulatorische Information zu verstehen, die Auswirkungen genomischer Variationen auf Funktion und Krankheit vorherzusagen und synthetische Enhancer für biotechnologische Anwendungen, insbesondere die Gentherapie, zu entwickeln. The instructions for when and where each of the approximately 20,000 human protein-coding genes is to be expressed are encoded in the DNA sequences of transcriptional enhancers. Enhancers are genomic non-coding cis-regulatory elements that act as on-off switches of gene transcription. The vast majority of disease-associated mutations fall into the non-coding part of the genome and appear to be particularly enriched in enhancers and affect gene regulation. However, despite the importance of enhancers for development and disease, deciphering the link between the sequence of an enhancer and its regulatory activity has remained one of the greatest challenges in biology, and neither predicting enhancer activity nor designing synthetic enhancers with specific properties has been achieved. The aim of this PhD thesis was to achieve a better understanding of the cis-regulatory information encoded in enhancer sequences by integrating deep learning algorithms with high-throughput enhancer testing and systematic enhancer sequence perturbation assays, using Drosophila melanogaster S2 cells as the main model system. First, I developed a deep learning model, DeepSTARR, that predicts the enhancer activity of any DNA sequence, its critical nucleotides, and enables the design of synthetic enhancers de novo. I applied this approach to Drosophila S2 cells and trained DeepSTARR to learn its enhancer sequence code with increased accuracy. In a second step, I interpreted the model and revealed long-sought-after sequence rules for enhancers, including the importance of motif-flanking nucleotides and transcription factor motif-motif distances. We validated these rules experimentally and demonstrated their conservation in human enhancers. Finally, we also designed and functionally validated synthetic enhancers with desired activities, not only demonstrating the validity of the model and its rules but also illustrating the power of such approaches for synthetic biology. To further understand the rules of enhancer sequence syntax, we designed a large-scale enhancer mutagenesis project. The resultant enhancer activity changes validated the predictive sequence features of DeepSTARR and revealed that enhancers display constrained sequence flexibility – only a specific but still diverse set of sequences and TF motifs can function at a given position. This activity of motifs at specific positions is strongly determined by the enhancer sequence context, namely the flanking sequence, presence and diversity of other motif types, and distance between motifs. Altogether, my work could provide the basis of current and future efforts to understand the regulatory information encoded in the human genome, predict the impact of genomic variation on function and disease, and of engineering synthetic enhancers for biotechnological applications, especially gene therapy.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::d7ff4077e34903025543b71598bfba54 Zobrazit plný text záznamu