Bayesian Variable Selection for Probit Mixed Models Applied to Gene Selection

Autor: Meili Baragatti
Přispěvatelé: Institut de mathématiques de Luminy (IML), Centre National de la Recherche Scientifique (CNRS)-Université de la Méditerranée - Aix-Marseille 2, Université de la Méditerranée - Aix-Marseille 2-Centre National de la Recherche Scientifique (CNRS)
Rok vydání: 2011
Předmět:
Statistics and Probability
Mixed model
FOS: Computer and information sciences
Individual gene
Computer science
Probit
grouping technique (or blocking technique)
Latent variable
computer.software_genre
Bayesian inference
01 natural sciences
Quantitative Biology - Quantitative Methods
Statistics - Applications
Methodology (stat.ME)
010104 statistics & probability
03 medical and health sciences
62J07
probit mixed regression model
random effects
Applications (stat.AP)
0101 mathematics
Statistics - Methodology
Quantitative Methods (q-bio.QM)
030304 developmental biology
Bayesian variable selection
0303 health sciences
[STAT.AP]Statistics [stat]/Applications [stat.AP]
Metropolis-within-Gibbs algorithm
Applied Mathematics
92D10
Random effects model
Gene selection
62-04
62P10
FOS: Biological sciences
62J12
Data mining
62F15
computer
[STAT.ME]Statistics [stat]/Methodology [stat.ME]
Zdroj: Bayesian Analysis
Bayesian Analysis, International Society for Bayesian Analysis, 2011, 6 (2), pp.209-230. ⟨10.1214/11-BA607⟩
Bayesian Anal. 6, no. 2 (2011), 209-229
Bayesian Analysis, 2011, 6 (2), pp.209-230. ⟨10.1214/11-BA607⟩
ISSN: 1936-0975
1931-6690
DOI: 10.48550/arxiv.1101.4577
Popis: International audience; In computational biology, gene expression datasets are characterized by very few individual samples compared to a large number of measurements per sample. Thus, it is appealing to merge these datasets in order to increase the number of observations and diversify the data, allowing a more reliable selection of genes relevant to the biological problem. Besides, the increased size of a merged dataset facilitates its re-splitting into training and validation sets. This necessitates the introduction of the dataset as a random effect. In this context, extending a work of Lee et al. (2003), a method is proposed to select relevant variables among tens of thousands in a probit mixed regression model, considered as part of a larger hierarchical Bayesian model. Latent variables are used to identify subsets of selected variables and the grouping (or blocking) technique of Liu (1994) is combined with a Metropolis-within-Gibbs algorithm (Robert and Casella 2004). The method is applied to a merged dataset made of three individual gene expression datasets, in which tens of thousands of measurements are available for each of several hundred human breast cancer samples. Even for this large dataset comprised of around 20000 predictors, the method is shown to be efficient and feasible. As an illustration, it is used to select the most important genes that characterize the estrogen receptor status of patients with breast cancer.
Databáze: OpenAIRE