Breast Cancer Microarray Gene-Expression Data Sets Integration and Analysis

Autor: Hsin-Chieh Yao, 姚欣潔
Rok vydání: 2008
Druh dokumentu: 學位論文 ; thesis
Popis: 96
Unsuccessful clustering, as a result of different hybridization buffer used in a second set of samples, leads to repetitive experiments on the same samples using the original buffer. Thus, we have two sets of gene expression data for the same 36 samples, breast cancer samples. This heterogeneity provides unnecessary complication in data analysis and, even worse, given false classification in clustering. However, this repetition provides an ultimate test on data treatment methods for possible removal of buffer effects and, eventually, a useful approach for data integration. Subgroup standardization is proposed to compensate for the buffer effect in microarray experiments. This is performed immediately after the normalization step. Provided with repetitive microarray experiments on all 36 samples, the percentage of pair-wise matching for all 36 samples using hierarchical clustering can be used to evaluate different approaches. Using the subgroup standardization, the matching rate is improved by a factor of 94%, 31% and 8% for Lowess, Median Rank Scores (MRS), and quantile normalizations, respectively. The proposed subgroup standardization enhances the performance of data integration for microarray data, regardless of normalization methods. The results are validated via repetitive experiments for the same samples using different buffers on the same platform. Using pair-wise matching from hierarchical clustering as a measure, quantile normalization performs better than MRS, with Lowess performing the worst. However, they all can be further improved using subgroup standardization. To take one step ahead, we aim to classify the ER positive and ER negative patient groups based on the different normalization methods with and without subgroup standardization. We completely imitate the TSP classifier to choose candidate genes about ER values and apply simulated annealing to search for the optimized combination of genes according to the scores. Then we could compare the outcome and effects by some indications, such as matching rate, sensitivity, specificity, and ER hierarchical clustering results both in training data and testing data. We discover that subgroup standardization is useful and helpful to classify ER positive or negative patients and also matching rate when collocating hierarchical clustering. It is an effective way when we try to view the group performance of the whole data sets. Since the sensitivity is bad, however, we should not use it when we want to peruse the behavior and details of every single sample.
Databáze: Networked Digital Library of Theses & Dissertations