Autor: |
Jørgensen, Jakob Rødsgaard, Nellemann, Katrine Scheel, Assent, Ira |
Jazyk: |
angličtina |
Rok vydání: |
2021 |
Zdroj: |
Jørgensen, J R, Nellemann, K S & Assent, I 2021, ' GPU-INSCY: A GPU-Parallel Algorithm and Tree Structure for Efficient Density-based Subspace Clustering ', Paper presented at 24th International Conference on Extending Database Technology, 23/03/2021-26/03/2021 . https://doi.org/10.5441/002/edbt.2021.04 |
DOI: |
10.5441/002/edbt.2021.04 |
Popis: |
Subspace clustering is the task of grouping objects based on mutual similarity in subspaces of the full-dimensional space.The INSCY algorithm extends the well-known density-based clustering algorithm DBSCAN. It finds dimensionality-unbiased non-redundant subspace clusters using a tree structure to speed up the processing of subspaces. Still, finding density-based clusters in all subspaces implies an exponential search space in the number of dimensions. Thus, the running time of INSCY is still measured in hours on even small datasets of 2000 points. For larger datasets, it becomes prohibitively expensive.To benefit from INSCY for real-world sized datasets, we propose a novel GPU-parallel approach that runs on standard graphics cards. To utilize the many cores of the GPU, we need new algorithmic strategies that fit the computational model of the GPU. While the GPU provides a large number of threads, traditional algorithms incur diverging threads and poor memory alignment, both of which lead to idle time and poor runtime performance. In INSCY, extracting subspace regions from the SCY-tree structure and the density-based clustering of regions itself are thus unfit for the GPU. Our novel GPU-friendly algorithm GPU-INSCY computes the same subspace clustering as INSCY at dramatically reduced runtimes. To achieve this, we devise a restructured SCY-tree index-structure and associated operations for the GPU, as well as a GPU-parallel density-based subspace clustering.We experimentally show that GPU-INSCY scales well with the size of the dataset and the number of dimensions, and improves the running time of INSCY by a factor of several thousand for large datasets of high dimensionality. |
Databáze: |
OpenAIRE |
Externí odkaz: |
|