Assessing the Use of Secondary Structure Fingerprints and Deep Learning to Classify RNA Sequences

Autor: Marcel Turcotte, Kevin Sutanto
Rok vydání: 2020
Předmět:
Zdroj: BIBM
DOI: 10.1109/bibm49941.2020.9313183
Popis: Non-coding RNAs (ncRNAs) are RNA molecules that do not code for protein, but take part in biological processes, including gene expression. Interestingly, like proteins, they can fold into complex structures to perform their wide array of biological functions. Since the folded structure of a ncRNA may be critical to its function, many studies have attempted to exploit structural data to infer information, often using machine learning techniques. For instance, they have used predicted secondary structures as input features to various machine learning techniques, in order to classify RNA sequences. However, it is known that a strand of RNA can fold into more than one possible structure, and some strands even form different structures in vivo and in vitro. Furthermore, ncRNAs often function as RNA-protein complexes, which can affect structure. We therefore hypothesized that using a single predicted secondary structure for a single sequence may discard important information, which may result in poorer classification accuracy. To investigate this claim, we propose the use of secondary structure fingerprints as features for machine learning applications, and report on a preliminary evaluation of this approach. The fingerprints comprise two categories: a higher-level (topological) representation derived from RNA-As-Graphs (RAG), and free energy fingerprints based on a novel curated repertoire of small RNA motifs. We have also evaluated our deep learning architecture with k-mers as features, alone and combined with secondary structure fingerprints; to see whether secondary structures or nucleotide composition is more useful in RNA classification, and whether or not both feature types complement each other well. The dataset, trained models, and supplemental material of this study are available at https://www.site.uottawa.ca/turcotte/bibm2020.
Databáze: OpenAIRE