Popis: |
Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters is. In addition, few methods exist for unsupervised clustering of bulk RNA-seq samples, and no method exists that can do so while simultaneously adjusting for between-sample global normalization factors, accounting for potential confounding variables, and selecting cluster-discriminatory genes. In Chapter 2, we present FSCseq (Feature Selection and Clustering of RNA-seq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and employs a quadratic penalty method with a SCAD penalty. The maximization is done by a penalized EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership. The field of deep learning has also boomed in popularity in recent years, fueled initially by its performance in the classification and manipulation of image data, and, more recently, in areas of public health, medicine, and biology. However, the presence of missing data in these latter areas is very common, and involves more complicated mechanisms of missingness than the former. While a rich statistical literature exists regarding the characterization and treatment of missing data in traditional statistical models, it is unclear how such methods may extend to deep learning methods. In Chapter 3, we present NIMIWAE (Non-Ignorably Missing Importance Weighted AutoEncoder), an unsupervised learning algorithm which provides a formal treatment of missing data in the context of Importance Weighted Autoencoders (IWAEs), an unsupervised Bayesian deep learning architecture, in order to perform single and multiple imputation of missing data. We review existing methods that handle up to the missing at random (MAR) missingness, and propose methods to handle the more difficult missing not at random (MNAR) scenario. We show that this extension is critical to ensure the performance of data imputation, as well as downstream coefficient estimation. We utilize simulation examples to illustrate the impact of missingness on such tasks, and compare the performance of several proposed methods in handling missing data. We applied our proposed methods to a large electronic healthcare record dataset, and illustrated its utility through a qualitative look at the downstream fitted models after imputation. Finally, in Chapter 4, we present dlglm (deeply-learned generalized linear model), a supervised learning algorithm that extends the missing data methods from Chapter 3 directly to supervised learning tasks such as classification and regression. We show that dlglm can be trained in the presence of missing data in both the predictors and the response, and under the MCAR, MAR, and MNAR missing data settings. We also demonstrate that the trained dlglm model can directly predict response on partially-observed samples in the prediction or test set, drawing from the learned variational posterior distribution of the missing values conditional on the observed values during model training. We utilize statistical simulation and real-world datasets to show the impact of our method in increasing accuracy of coefficient estimation and predictionunder different mechanisms of missingness. |