RT-DBSCAN: Real-Time Parallel Clustering of Spatio-Temporal Data Using Spark-Streaming

Autor: Yikai Gong, Paul Rimba, Richard O. Sinnott
Rok vydání: 2018
Předmět:
Zdroj: Lecture Notes in Computer Science ISBN: 9783319936970
ICCS (1)
DOI: 10.1007/978-3-319-93698-7_40
Popis: Clustering algorithms are essential for many big data applications involving point-based data, e.g. user generated social media data from platforms such as Twitter. One of the most common approaches for clustering is DBSCAN. However, DBSCAN has numerous limitations. The algorithm itself is based on traversing the whole dataset and identifying the neighbours around each point. This approach is not suitable when data is created and streamed in real-time however. Instead a more dynamic approach is required. This paper presents a new approach, RT-DBSCAN, that supports real-time clustering of data based on continuous cluster checkpointing. This approach overcomes many of the issues of existing clustering algorithms such as DBSCAN. The platform is realised using Apache Spark running over large-scale Cloud resources and container based technologies to support scaling. We benchmark the work using streamed social media content (Twitter) and show the advantages in performance and flexibility of RT-DBSCAN over other clustering approaches.
Databáze: OpenAIRE