Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

Autor:	Weina He, Feifei Ning, Dongliang Xia
Rok vydání:	2020
Předmět:	020203 distributed computing Computer Networks and Communications business.industry Computer science Estimation theory Process (computing) k-means clustering 020206 networking & telecommunications Cloud computing 02 engineering and technology Set (abstract data type) Dimension (vector space) Hardware and Architecture Spark (mathematics) 0202 electrical engineering electronic engineering information engineering Cluster analysis business Algorithm Software Information Systems
Zdroj:	Journal of Grid Computing. 18:263-273
ISSN:	1572-9184 1570-7873
DOI:	10.1007/s10723-019-09504-z
Popis:	Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::0efcbc12962ec2ee6b7a068e3e02909c https://doi.org/10.1007/s10723-019-09504-z Zobrazit plný text záznamu Full text from SpringerLink