Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Autor: Rajiv Movva, Peyton Greenside, Surag Nair, Georgi K. Marinov, Avanti Shrikumar, Anshul Kundaje
Rok vydání: 2018
Předmět:
RNA
Untranslated

Gene Expression
Regulatory Sequences
Nucleic Acid

Genome
Convolutional neural network
Biochemistry
Database and Informatics Methods
0302 clinical medicine
Genes
Reporter

Gene expression
Regulation of gene expression
0303 health sciences
Multidisciplinary
Chromosome Biology
High-Throughput Nucleotide Sequencing
Hep G2 Cells
Genomics
Chromatin
Regulatory sequence
Medicine
Biological Assay
Epigenetics
Sequence Analysis
Research Article
Computer and Information Sciences
Neural Networks
Sequence analysis
Bioinformatics
Science
Nucleotide Sequencing
Computational biology
Biology
Research and Analysis Methods
Polymorphism
Single Nucleotide

DNA sequencing
03 medical and health sciences
Sequence Motif Analysis
Cell Line
Tumor

DNA-binding proteins
Genome-Wide Association Studies
Genetics
Humans
Gene Regulation
Allele
Molecular Biology Techniques
Sequencing Techniques
Gene
Transcription factor
Molecular Biology
Alleles
030304 developmental biology
Genome
Human

Biology and Life Sciences
Computational Biology
Proteins
Human Genetics
DNA
Sequence Analysis
DNA

Cell Biology
Genome Analysis
Noncoding DNA
Regulatory Proteins
Neural Networks
Computer

K562 Cells
030217 neurology & neurosurgery
Software
Transcription Factors
Neuroscience
Zdroj: PLoS ONE
PLoS ONE, Vol 14, Iss 6, p e0218073 (2019)
ISSN: 1932-6203
Popis: The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearmanρ= 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.
Databáze: OpenAIRE