Popis: |
We describe the design and implementation of EdgeFrame: a graph-specialized Spark DataFrame that caches the edges of a graph in compressed form on all worker nodes of a cluster, and provides a fast and scalable Worst-Case-Optimal Join (WCOJ) that is especially useful for matching of complex and cyclical patterns in large graphs. Our choice to forego shuffle- or communication-based WCOJ is motivated by our analysis of the Shares algorithm for distributed WCOJ, that was proven communication-optimal, but which we show to quickly deteriorate to a full broadcast of all data already with moderately complex graph patterns. Our work shows that specializing WCOJ to a multi-way self-join, and leveraging compressed storage, provides a significant opportunity for better WCOJ performance. Finally, we investigate WCOJ parallelization and load-balancing strategies and show that fine-grained dynamic load-balancing with work-stealing is to be preferred, creating interesting insights and challenges for the future evolution of the Spark scheduler. |