STICA: Spanning-Tree-Based Identification of Clusters With Agglomeration for Numerical Data

Autor: Mozammel H. A. Khan
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: IEEE Access, Vol 12, Pp 150430-150450 (2024)
Druh dokumentu: article
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2024.3480460
Popis: Clustering is an unsupervised method used to group data points based on their similarity. It has many applications in data mining and preprocessing. Algorithms used for arbitrarily shaped clustering have several real-world applications. However, most of these algorithms require input parameters, such as the number of clusters, neighborhood radius, density threshold, and number of grids. However, it is difficult to estimate these parameters. A pragmatic approach is to develop a simple but parameterless clustering algorithm. To achieve this goal, we propose a parameterless algorithm for arbitrarily shaped clustering called spanning-tree-based identification of clusters (STIC) for numerical data. We consider data points as nodes of spanning trees (STs). We determined the STs’ edges based on the Euclidean distances between the node pairs and an internally estimated edge weight threshold. We generated clusters using breadth-first search (BFS) based on the notion of a spanning-tree-based representation of a cluster. Data points that were not visited by the BFS process were outliers. This algorithm is effective for datasets with inter-cluster separation. The STIC algorithm divides heterogeneous clusters into sub-clusters. In this case, we performed agglomerative merging of sub-clusters and outliers based on the single-linkage distances (SLDs) of the clusters and an internally estimated SLD threshold. This extended algorithm is called the STIC with agglomeration (STICA). Our experimental results with artificial datasets demonstrate that the STICA algorithm outperforms several widely used distance-based algorithms for arbitrarily shaped clusters. However, for real-world datasets, the STICA algorithm behaves similarly to other distance-based algorithms.
Databáze: Directory of Open Access Journals