What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.

Autor: Penzar DD; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.; Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia.; Department of Medical and Biological Physics, Moscow Institute of Physics and Technology (State University), Dolgoprudny, Russia., Zinkevich AO; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.; Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia., Vorontsov IE; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia., Sitnik VV; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia., Favorov AV; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.; Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University School of Medicine, Baltimore, MD, United States., Makeev VJ; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.; Department of Medical and Biological Physics, Moscow Institute of Physics and Technology (State University), Dolgoprudny, Russia.; Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia., Kulakovskiy IV; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.; Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia.; Institute of Mathematical Problems of Biology RAS - the Branch of Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, Pushchino, Russia.
Jazyk: angličtina
Zdroj: Frontiers in genetics [Front Genet] 2019 Oct 31; Vol. 10, pp. 1078. Date of Electronic Publication: 2019 Oct 31 (Print Publication: 2019).
DOI: 10.3389/fgene.2019.01078
Abstrakt: Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent "Regulation Saturation" Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the "information leakage" caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.
(Copyright © 2019 Penzar, Zinkevich, Vorontsov, Sitnik, Favorov, Makeev and Kulakovskiy.)
Databáze: MEDLINE