An effective clustering scheme for high-dimensional data.

Autor: He, Xuansen, He, Fan, Fan, Yueping, Jiang, Lingmin, Liu, Runzong, Maalla, Allam
Předmět:
Zdroj: Multimedia Tools & Applications; May2024, Vol. 83 Issue 15, p45001-45045, 45p
Abstrakt: While the classical K-means algorithm has been widely used in many fields, it still has some defects. Therefore, this paper proposes a scheme to improve the clustering quality of K-means algorithm. The farthest initial center selection and the min–max rule are used to improve the random initialization of K-means algorithm, which can avoid the empty clusters in the clustering results. For high-dimensional data sets, standardized feature scaling makes the data subject to normal distribution, and supervised linear discriminant analysis (LDA) is used to effectively reduce the data dimension and facilitate visualization. The empirical rule is used to estimate the range of the number of clusters. Within this range, the number of clusters of data is visually estimated by searching the elbow of the sum-of-squared-errors (SSE) curve. Further, a novel clustering validity function f(K) is proposed to determine the optimal number of clusters for complex real-world data sets. Through silhouette analysis, the clustering quality can be intuitively evaluated by calculating the silhouette coefficient of cluster and observing its size. The simulation results of different types of data sets show that this scheme can not only improve the clustering quality of K-means algorithm, but also provide a visual cluster analysis method for high-dimensional data sets. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index