A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability
Autor: | Abdorrahman Haeri, Fateme Moslehi, Seyed Alireza Mousavian Anaraki |
---|---|
Rok vydání: | 2021 |
Předmět: |
Computer science
business.industry Dimensionality reduction k-means clustering Initialization 020207 software engineering Pattern recognition 02 engineering and technology Silhouette Reduction (complexity) Artificial Intelligence Principal component analysis 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition Artificial intelligence business Cluster analysis Interpretability |
Zdroj: | Pattern Analysis and Applications. 24:1387-1402 |
ISSN: | 1433-755X 1433-7541 |
DOI: | 10.1007/s10044-021-00977-x |
Popis: | The K-means algorithm is a popular clustering method, which is sensitive to the initialization of samples and selecting the number of clusters. Its performance on high-dimensional datasets is considerably influenced. Principal component analysis (PCA) is a linear dimensionless reduction method that is closely related to the K-means algorithm. Dimension reduction leads to the selection of initial centers in a smaller space, which is a solution to solve initialization problems. The present study investigates the reciprocal relationship between K-means and PCA and adopts an innovative approach of creating sub-datasets and applying step-by-step labeling in the hybrid execution of both algorithms to propose two methods, namely K-P and P-K. The clusters that are obtained from the two proposed methods are of high interpretability. This was verified by the step-by-step labeling results of a human resource dataset. Interpretability was evaluated via the distribution of features of interest (FoI), suggesting improved results for both datasets. In addition to the improvement of the qualitative results, the outcome of the present study showed the sum of squared estimate of errors (SSE)/N (total number of data) and silhouette improvement of 10 datasets with eight initialization methods in previous studies. The P-K results and run time were better than the K-P ones. |
Databáze: | OpenAIRE |
Externí odkaz: |