Popis: |
We propose a reference feature extractor that can be used for methylation data and potentially other epigenomic data sources. In doing so, it can be used in a trans-omics manner to bridge between epigenomics and transcriptomics. By having an internal latent space, it can solve classification/regression problems in a trans-omics manner. DNA methylation data is part of epigenomics data that is altered by external factors including the change in environment. It has multiple roles including the regulation of gene expression. The goal of the reference feature extractor is to extract important features from the DNA methylation data while encoding the features in a low dimensional feature space. To achieve this, a pan-cancer dataset was used to train the model with a wide variety of data. Due to the low dimensional encoding, downstream tasks can be solved while utilising significantly fewer parameters. The current state-of-the-art can work with a trans-omics setting, but it was not able to generalise the model so that it could work in other settings [1--3]. For example, TDImpute [4] needed an extra decision-making model to complete the classification task, while not utilising the latent feature representation inferred inside the model. Furthermore, a multi-layer perceptron, called LDEncoder, used in this approach has a low encoding dimension (512), which is used to represent the high dimensional DNA methylation data in a significantly lower-dimensional feature space. So, if the new classification/regression problem needs to be solved, the input dimension of 512 can be used for the transfer learning of the model. This significantly reduces the amount of time and computational resources needed for solving problems. In effect, transforming the DNA methylation data to gene expression data (RNA-seq) while having a bottleneck enables the lower dimensional encoding of the data. Also, in a similar scenario, we evaluated the performance of various models and techniques inspired by successful ones in computer vision. These included incorporating the model parameter savers based on the best validation loss and CpG site sorting1. We found some promising results as shown in Table 1. Also, we further evaluate the generalisability of the model through cancer/non-cancer prediction and breast cancer molecular subtype prediction results. |