Efficient Computation and Visualization of Multiple Density-Based Clustering Hierarchies
Autor: | Jörg Sander, Mario A. Nascimento, Ricardo J. G. B. Campello, Antonio Cavalcante Araujo Neto |
---|---|
Rok vydání: | 2021 |
Předmět: | |
Zdroj: | IEEE Transactions on Knowledge and Data Engineering. 33:3075-3089 |
ISSN: | 2326-3865 1041-4347 |
DOI: | 10.1109/tkde.2019.2962412 |
Popis: | HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter $mpts$ m p t s . While a small change in $mpts$ m p t s typically leads to a small change in the clustering structure, choosing a “good” $mpts$ m p t s value can be challenging: depending on the data distribution, a high or low $mpts$ m p t s value may be more appropriate, and certain clusters may reveal themselves at different values. To explore results for a range of $mpts$ m p t s values, one has to run HDBSCAN* for each value independently, which can be computationally impractical. In this paper, we propose an approach to efficiently compute all HDBSCAN* hierarchies for a range of $mpts$ m p t s values by building upon results from computational geometry to replace HDBSCAN*’s complete graph with a smaller equivalent graph. An experimental evaluation shows that our approach can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about twice, which corresponds to a speedup of more than 60 times, compared to running HDBSCAN* independently that many times. We also propose a series of visualizations that allow users to analyze a collection of hierarchies for a range of $mpts$ m p t s values, along with case studies that illustrate how these analyses are performed. |
Databáze: | OpenAIRE |
Externí odkaz: |