On the Hardness and Approximation of Euclidean DBSCAN
Autor: | Yufei Tao, Junhao Gan |
---|---|
Rok vydání: | 2017 |
Předmět: |
Discrete mathematics
DBSCAN Computer science Dimension (graph theory) OPTICS algorithm 0102 computer and information sciences 02 engineering and technology 01 natural sciences Euclidean distance 010201 computation theory & mathematics SUBCLU 020204 information systems 0202 electrical engineering electronic engineering information engineering Cluster analysis Time complexity Algorithm Information Systems Curse of dimensionality |
Zdroj: | ACM Transactions on Database Systems. 42:1-45 |
ISSN: | 1557-4644 0362-5915 |
DOI: | 10.1145/3083897 |
Popis: | DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O ( n log n ) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O ( n log n )-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since. This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω( n 4/3 ) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ- approximate DBSCAN , which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O ( n ) expected time regardless of the constant dimensionality d . The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O ( n ) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O ( n log log n ) time deterministically, improving the existing O ( n log n ) bound. |
Databáze: | OpenAIRE |
Externí odkaz: |