Computational interpretation of disease-causing, structural, and non-coding human genetic variants

Autor: Kleinert, Philip
Rok vydání: 2022
Předmět:
DOI: 10.17169/refubium-36353
Popis: While the first version of the human genome sequence was completed two decades ago, the understanding of many genomic variants remains elusive. Novel insights and technological advances improve the power to interpret genetic alterations in the genome. However, clinical applications lack behind basic research due to reduced accessibility of knowledge and tools to benefit therapeutic outcomes and patients. In this thesis I help improving the interpretation of human genetic variants and increasing accessibility of these tools by using three independent approaches. To provide insights and access to variant interpretation to researchers and clinicians, I develop a tool to refurbish and analyze targeted sequencing of genomic regions for screening of patient cohorts on the example of an established hemophilia A & B MIP design from the “My Life, Our Future” initiative. In a user-friendly HTML report “hemoMIPs” summarizes covered, incomplete, or missing regions, called variants and their predicted effects. HemoMIPs is published and available as an open-source tool on GitHub. In a second approach, I look at genomic structural variants (SVs) and estimate their effect on human health and disease using machine learning. Models are trained on human and chimpanzee derived SVs contrasted with matched simulated variants, an approach that has proven powerful for short sequence variants. “CADD-SV” computes summary statistics over diverse variant annotations and uses random forest models to prioritize functional SVs. The resulting CADD-SV scores correlate with known pathogenic, rare population and somatic cancer variants. This approach is published and available as an online scoring service as well as an open-source software on GitHub. Especially the interpretation of non-coding variants lacks behind coding regions. In my third approach I focus on non-coding variants in binding sites of a widely studied DNA-binding protein (CTCF). Here, I develop a workflow to identify human-specific gained or lost CTCF binding sites using great ape and human datasets. Variants are prioritized for their impact on 3D genome architecture using a comprehensive set of annotations. Candidates are enriched in genomic regions mediating brain development. Further, independent experimental validation using chimp, orang and human NPCs and organoids show high overlap with this computational approach.
Databáze: OpenAIRE