An LSH-based k-representatives clustering method for large categorical data

Autor: Toan Nguyen Mau, Van-Nam Huynh
Rok vydání: 2021
Předmět:
Zdroj: Neurocomputing. 463:29-44
ISSN: 0925-2312
DOI: 10.1016/j.neucom.2021.08.050
Popis: Clustering categorical data remains a challenging problem in the era of big data, due to the difficulty in measuring dis/similarity meaningfully for categorical data and the high computational complexity of existing clustering algorithms that makes it difficult to be applied in practical use for big data mining applications. In this paper, we propose an integrated approach that incorporates the Locality-Sensitive Hashing (LSH) technique into the k -means-like clustering so as to make it capable of predicting the better initial clusters for boosting clustering effectiveness. To this end, we first utilize a data-driven dissimilarity measure for categorical data to construct a family of binary hash functions that are then used to generate the initial clusters. We also propose to use a nearest neighbor search at each iteration for cluster reassignment of data objects to improve the clustering complexity. These solutions are incorporated into the k -representatives algorithm resulting in the so-called LSH- k -representatives algorithm. Extensive experiments conducted on multiple real-world and synthetic datasets have demonstrated the effectiveness of the proposed method. It is shown that the newly developed algorithm yields comparable or better clustering results in comparison to the existing closely related works, yet it is significantly more efficient by a factor of between 2 × and 32 × .
Databáze: OpenAIRE