Efficient Computation and Visualization of Multiple Density-Based Clustering Hierarchies

Autor: Jörg Sander, Mario A. Nascimento, Ricardo J. G. B. Campello, Antonio Cavalcante Araujo Neto
Rok vydání: 2021
Předmět:
Zdroj: IEEE Transactions on Knowledge and Data Engineering. 33:3075-3089
ISSN: 2326-3865
1041-4347
DOI: 10.1109/tkde.2019.2962412
Popis: HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter $mpts$ m p t s . While a small change in $mpts$ m p t s typically leads to a small change in the clustering structure, choosing a “good” $mpts$ m p t s value can be challenging: depending on the data distribution, a high or low $mpts$ m p t s value may be more appropriate, and certain clusters may reveal themselves at different values. To explore results for a range of $mpts$ m p t s values, one has to run HDBSCAN* for each value independently, which can be computationally impractical. In this paper, we propose an approach to efficiently compute all HDBSCAN* hierarchies for a range of $mpts$ m p t s values by building upon results from computational geometry to replace HDBSCAN*’s complete graph with a smaller equivalent graph. An experimental evaluation shows that our approach can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about twice, which corresponds to a speedup of more than 60 times, compared to running HDBSCAN* independently that many times. We also propose a series of visualizations that allow users to analyze a collection of hierarchies for a range of $mpts$ m p t s values, along with case studies that illustrate how these analyses are performed.
Databáze: OpenAIRE