Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays
Autor: | Rajiv Movva, Peyton Greenside, Surag Nair, Georgi K. Marinov, Avanti Shrikumar, Anshul Kundaje |
---|---|
Rok vydání: | 2018 |
Předmět: |
RNA
Untranslated Gene Expression Regulatory Sequences Nucleic Acid Genome Convolutional neural network Biochemistry Database and Informatics Methods 0302 clinical medicine Genes Reporter Gene expression Regulation of gene expression 0303 health sciences Multidisciplinary Chromosome Biology High-Throughput Nucleotide Sequencing Hep G2 Cells Genomics Chromatin Regulatory sequence Medicine Biological Assay Epigenetics Sequence Analysis Research Article Computer and Information Sciences Neural Networks Sequence analysis Bioinformatics Science Nucleotide Sequencing Computational biology Biology Research and Analysis Methods Polymorphism Single Nucleotide DNA sequencing 03 medical and health sciences Sequence Motif Analysis Cell Line Tumor DNA-binding proteins Genome-Wide Association Studies Genetics Humans Gene Regulation Allele Molecular Biology Techniques Sequencing Techniques Gene Transcription factor Molecular Biology Alleles 030304 developmental biology Genome Human Biology and Life Sciences Computational Biology Proteins Human Genetics DNA Sequence Analysis DNA Cell Biology Genome Analysis Noncoding DNA Regulatory Proteins Neural Networks Computer K562 Cells 030217 neurology & neurosurgery Software Transcription Factors Neuroscience |
Zdroj: | PLoS ONE PLoS ONE, Vol 14, Iss 6, p e0218073 (2019) |
ISSN: | 1932-6203 |
Popis: | The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearmanρ= 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced. |
Databáze: | OpenAIRE |
Externí odkaz: |