Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)

Autor: Rusli Bin Abdullah, Maslina Binti Zolkepli, Sayer Alshammari
Rok vydání: 2019
Předmět:
Zdroj: Advances in Intelligent Systems and Computing ISBN: 9783030360559
SCDM
DOI: 10.1007/978-3-030-36056-6_10
Popis: Data clustering algorithm has been receiving considerable attention in many application areas such as data mining, document retrieval, image processing and pattern classification. A hybrid data clustering algorithm using the combination of genetic algorithm (GA) with a popular variant of K-Means clustering algorithm, parallel k-Means clustering algorithm (PKCA) is proposed in this paper. The objective of the proposed algorithm is to combine the search process of GA to generate new data clusters and apply parallel K-Means to further speed up the quality of the search process during clusters formation. The proposed approach is implemented using the popular MapReduce programming model on Hadoop framework. Experiments were conducted with multiple synthetic datasets to evaluate the performance of the proposed algorithm. Results show that the proposed algorithm was able to speed up document clustering process by 0.54 s on average and outperformed PKCA. Data analysts in marketing and finance, telecommunication and transport companies and researchers in academia can use this algorithm to make sense out of their huge volume of data.
Databáze: OpenAIRE