Popis: |
Gene expression datasets obtained from DNA microarrays are examples of large-dimensional and structured datasets. In this thesis, we approach the task of applying pattern recognition techniques on gene expression data from an exploratory perspective. First, we develop methods and algorithms for the classification and prediction of cancer classes from large-dimensional gene expression data. We also develop algorithms for extracting the information content hidden within the compositions of the discovered classifiers. Second, we look at the problems of clustering structured gene expression data, such as temporal expression profiles, by introducing a method to cluster genes having partially similar profiles. We demonstrate the classification methods on a gene expression dataset containing two acute leukemia classes. A prioritized feature-selection approach is followed to account for incomplete knowledge of gene function and complex inter-gene dependencies. We utilize a combination of class scatter metrics and heuristic search algorithms to determine all those minimal combinations of genes that have potential to discriminate between the two leukemia classes. A modified perceptron training algorithm further trains the discriminant gene-sets. This process results in a large number of distinct and accurate classifiers. We present an algorithm which we then employ to mine these classifiers to discover ‘core’ patterns in their compositions. These gene-cores can be very useful to biologists searching for inter-gene dependencies and gene function. Most current clustering algorithms primarily cluster genes taking into account the entire feature set of conditions. However, it is of interest to discover groups of genes that are co-expressed only under certain conditions, especially when the data is structured. To address this need, we develop an ‘automatic partial-featureset clustering algorithm’ (APCA), and a set of heuristics, that can cluster genes according to partially similar expression profiles. The subset of features relevant to a particular clustering is chosen automatically as a part of the clustering process. We apply our algorithm on a synthetic dataset and contrast the results with those obtained by applying a standard K-means clustering algorithm on the same data. |