Bayesian variable selection in cluster analysis

Autor: Dimitrakopoulou, Vasiliki
Rok vydání: 2023
DOI: 10.22024/unikent/01.02.94305
Popis: Statistical analysis of data sets of high-dimensionality has met great interest over the past years, with great applications on disciplines such as medicine, neuroscience, pattern recognition, image analysis and many others. The vast number of available variables though, contrary to the limited sample size, often mask the cluster structure of the data. It is often that some variables do not help in distinguishing the different clusters in the data; patterns over the sampled observations are, thus, usually confined to a small subset of variables. We are therefore interested in identifying the variables that best discriminate the sample, simultaneously to recovering the actual cluster structure of the objects under study. With the Markov Chain Monte Carlo methodology being widely established, we investigate the performance of the combined tasks of variable selection and clustering procedure within the Bayesian framework. Motivated by the work of Tadesse et al. (2005), we identify the set of discriminating variables with the use of a latent vector and form the clustering procedure within the finite mixture models methodology. Using Markov chains we draw inference on, not just the set of selected variables and the cluster allocations, but also on the actual number of components: using the f:teversible Jump MCMC sampler (Green, 1995) and a variation of t he SAMS sampler of Dahl (2005). However, sensitivity to the hyperparameters settings of the covariance structure of the suggested model motivated our interest in an Empirical Bayes procedure to pre-specify the crucial hyper parameters. Further on addressing the problem of hyperparameters' sensitivity, we suggest several different covariance structures for the mixture components. Developing MATLAB codes for all models introduced in this thesis, we apply and compare the various models suggested on a set of simulated data, as well as on three real data sets; the iris, the crabs and the arthritis data sets.
Databáze: OpenAIRE