Popis: |
Kmeans is one of the most algorithms that are utilized in data clustering. Number of metrics is coupled with kmeans in order cluster data targeting the enhancement of both locally clusters compactness and the globally clusters separation. Then, before the ultimate data assignment to their corresponding clusters, the selection of the optimal number of clusters should constitute a crucial step in the clustering process. The present work aims to build up a new clustering metric/heuristic that takes into account both space dispersion and inferential characteristics of the data to be clustered. Hence, in this paper, a Geometry-Inference based Clustering (GIC) heuristic is proposed for selecting the optimal numbers of clusters. The conceptual approach proposes the “Initial speed rate” as the main geometric parameter to be inferentially studied. After, the corresponding histograms are fitted by means of classical distributions. A clear linear behaviour regarding the distributions’ parameters was detected according to the number of optimal clusters k* for each of the 14 datasets adopted in this work. Finally, for each dataset, the optimal k* is observed to match with the change-points assigned as the intersection of two clearly salient lines. All fittings are tested using Khi2 tests showing excellent fitting in terms of p-values, and R² also for linear fittings. Then, a change-point algorithm is launched to select k*. To sum up, the GIC heuristic shows a full quantitative aspect, and is fully automated; no qualitative index or graphical techniques are used herein. |