An Efficient Union Approach to Mining Closed Large Itemsets in DNA Microarray Datasets

Autor: Lee, Li-Wen
Jazyk: angličtina
Rok vydání: 2006
Předmět:
Druh dokumentu: Text
Popis: A DNA microarray is a very good tool to study the gene expression level in different situations. Mining association rules in DNA microarray datasets can help us know how genes affect each other, and what genes are usually co-expressed. Mining closed large itemsets can be useful for reducing the size of the mining result from the DNA microarray datasets, where a closed itemset is an itemset that there is no superset whose support value is the same as the support value of this itemset. Since the number of genes stored in columns is much larger than the number of samples stored in rows in a DNA microarray dataset, traditional mining methods which use column enumeration face a great challenge, where the column enumeration means that enumerating itemsets from the combinations of items stored in columns. Therefore, several row enumeration methods, e.g., RERII, have been proposed to solve this problem, where row enumeration means that enumerating itemsets from the combinations of items stored in rows. Although the RERII method saves more memory space and has better performance than the other row enumeration methods, it needs complex intersection operations at each node of the row enumeration tree to generate the closed itemsets. In this thesis, we propose a new method, UMiner, which is based on the union operations to mine the closed large itemsets in the DNA microarray datasets. Our approach is a row enumeration method. First, we add all tuples in the transposed table to a prefix tree, where a transposed table records the information about where an item appears in the original table. Next, we traverse this prefix tree to create a row-node table which records the information about a node and the related range of its child nodes in the prefix tree created from the transposed table. Then we generate the closed itemset by using the union operations on the itemsets in the item groups stored in the row-node table. Since we do not use the intersection operations to generate the closed itemset for each enumeration node, we can reduce the time complexity that is needed at each node of the row enumeration tree. Moreover, we develop four pruning techniques to reduce the number of candidate closed itemsets in the row enumeration tree. By replacing the complex intersection operations with the union operations and designing four pruning techniques to reduce the number of branches in the row enumeration tree, our method can find closed large itemsets very efficiently. In our performance study, we use three real datasets which are the clinical data on breast cancer, lung cancer, and AML-ALL. From the experiment results, we show that our UMiner method is always faster than the RERII method in all support values, no matter what the average length of the closed large itemsets is. Moreover, in our simulation result, we also show that the processing time of our method increases much more slowly than that of the RERII method as the average number of items in the rows of a dataset increases.
Databáze: Networked Digital Library of Theses & Dissertations