A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability

Autor: Abdorrahman Haeri, Fateme Moslehi, Seyed Alireza Mousavian Anaraki
Rok vydání: 2021
Předmět:
Zdroj: Pattern Analysis and Applications. 24:1387-1402
ISSN: 1433-755X
1433-7541
DOI: 10.1007/s10044-021-00977-x
Popis: The K-means algorithm is a popular clustering method, which is sensitive to the initialization of samples and selecting the number of clusters. Its performance on high-dimensional datasets is considerably influenced. Principal component analysis (PCA) is a linear dimensionless reduction method that is closely related to the K-means algorithm. Dimension reduction leads to the selection of initial centers in a smaller space, which is a solution to solve initialization problems. The present study investigates the reciprocal relationship between K-means and PCA and adopts an innovative approach of creating sub-datasets and applying step-by-step labeling in the hybrid execution of both algorithms to propose two methods, namely K-P and P-K. The clusters that are obtained from the two proposed methods are of high interpretability. This was verified by the step-by-step labeling results of a human resource dataset. Interpretability was evaluated via the distribution of features of interest (FoI), suggesting improved results for both datasets. In addition to the improvement of the qualitative results, the outcome of the present study showed the sum of squared estimate of errors (SSE)/N (total number of data) and silhouette improvement of 10 datasets with eight initialization methods in previous studies. The P-K results and run time were better than the K-P ones.
Databáze: OpenAIRE