Local-Density Subspace Distributed Clustering for High-Dimensional Data
Autor: | Juan Tan, Qingyong Li, Chong-Yung Chi, Mingfei Liang, Yangli-ao Geng, Heng Huang |
---|---|
Rok vydání: | 2020 |
Předmět: |
Clustering high-dimensional data
020203 distributed computing Computer science 02 engineering and technology Data modeling Kernel (linear algebra) Computational Theory and Mathematics Dimension (vector space) Hardware and Architecture Distributed algorithm Signal Processing 0202 electrical engineering electronic engineering information engineering Cluster analysis Algorithm Subspace topology Curse of dimensionality |
Zdroj: | IEEE Transactions on Parallel and Distributed Systems. 31:1799-1814 |
ISSN: | 2161-9883 1045-9219 |
DOI: | 10.1109/tpds.2020.2975550 |
Popis: | Distributed clustering is emerging along with the advent of the era of big data. However, most existing established distributed clustering methods focus on problems caused by a large amount of data rather than caused by the large dimension of data. Consequently, they suffer the “curse” of dimensionality (e.g., poor performance and heavy network overhead) when high-dimensional (HD) data are clustered. In this article, we propose a distributed algorithm, referred to as Local Density Subspace Distributed Clustering (LDSDC) algorithm, to cluster large-scale HD data, motivated by the idea that a local dense region of a HD dataset is usually distributed in a low-dimensional (LD) subspace. LDSDC follows a local-global-local processing structure, including grouping of local dense regions (atom clusters) followed by subspace Gaussian model (SGM) fitting (flexible and scalable to data dimension) at each sub-site, merging of atom clusters at every sub-site according to the merging result broadcast from the global site. Moreover, we propose a fast method to estimate the parameters of SGM for HD data, together with its convergence proof. We evaluate LDSDC on both synthetic and real datasets and compare it with four state-of-the-art methods. The experimental results demonstrate that the proposed LDSDC yields best overall performance. |
Databáze: | OpenAIRE |
Externí odkaz: |