RT-DBSCAN: Real-Time Parallel Clustering of Spatio-Temporal Data Using Spark-Streaming
Autor: | Yikai Gong, Paul Rimba, Richard O. Sinnott |
---|---|
Rok vydání: | 2018 |
Předmět: |
DBSCAN
Computer science business.industry Big data 02 engineering and technology computer.software_genre Temporal database 020204 information systems Container (abstract data type) Spark (mathematics) 0202 electrical engineering electronic engineering information engineering Benchmark (computing) 020201 artificial intelligence & image processing Data mining Cluster analysis business computer |
Zdroj: | Lecture Notes in Computer Science ISBN: 9783319936970 ICCS (1) |
DOI: | 10.1007/978-3-319-93698-7_40 |
Popis: | Clustering algorithms are essential for many big data applications involving point-based data, e.g. user generated social media data from platforms such as Twitter. One of the most common approaches for clustering is DBSCAN. However, DBSCAN has numerous limitations. The algorithm itself is based on traversing the whole dataset and identifying the neighbours around each point. This approach is not suitable when data is created and streamed in real-time however. Instead a more dynamic approach is required. This paper presents a new approach, RT-DBSCAN, that supports real-time clustering of data based on continuous cluster checkpointing. This approach overcomes many of the issues of existing clustering algorithms such as DBSCAN. The platform is realised using Apache Spark running over large-scale Cloud resources and container based technologies to support scaling. We benchmark the work using streamed social media content (Twitter) and show the advantages in performance and flexibility of RT-DBSCAN over other clustering approaches. |
Databáze: | OpenAIRE |
Externí odkaz: |