Mining non-derivable hypercliques

Autor: Anna Koufakou
Rok vydání: 2013
Předmět:
Zdroj: Knowledge and Information Systems. 41:77-99
ISSN: 0219-3116
0219-1377
DOI: 10.1007/s10115-013-0660-8
Popis: A hyperclique (Xiong et al. in Proceedings of the IEEE international conference on data mining, pp 387---394, 2003) is an itemset containing items that are strongly correlated with each other, based on a user-specified threshold. Hypercliques (HCs) have been successfully used in a number of applications, for example, clustering (Xiong et al. in Proceedings of the 4th SIAM international conference on data mining, pp 279---290, 2004) and noise removal (Xiong et al. in IEEE Trans Knowl Data Eng 18(3):304---319, 2006). Even though HC has been shown to respond well to datasets with skewed support distribution and low support threshold, it may still grow very large for dense datasets and lower h-confidence threshold. In this paper, we propose a new pruning method based on combining HCs and non-derivable itemsets (NDIs) (Calders and Goethals in Proceedings of the PKDD international conference on principles of data mining and knowledge discovery, pp 74---85, 2002) in order to substantially reduce the amount of generated HCs. Specifically, we propose a new collection of HCs, called non-derivable hypercliques (NDHCs). The NDHC collection is a lossless representation of HCs, that is, given the itemsets in NDHCs, we can generate the complete HC collection and their support, without additional scanning of the dataset. We present an efficient algorithm to mine all NDHC sets, NDHCMiner, and an algorithm to derive all HC sets and their support from NDHCs, NDHCDeriveAll. We experimentally compare our collection, NDHC with HC, with respect to runtime performance as well as total number of generated sets, using real and artificial data. We also show comparisons with another condensed representation of HCs, maximal hyperclique patterns (MHPs). Our experiments show that the NDHC collection offers substantial advantages over HCs, and even MHPs, especially for dense datasets and lower h-confidence values.
Databáze: OpenAIRE