An empirical comparison of Big Graph frameworks in the context of network analysis
Autor: | Jannis Koch, Henning Meyerhenke, Christian L. Staudt, Maximilian Vogel |
---|---|
Rok vydání: | 2016 |
Předmět: |
Social and Information Networks (cs.SI)
FOS: Computer and information sciences Connected component Theoretical computer science Relational database Computer science Communication Computer Science - Social and Information Networks 02 engineering and technology Complex network Computer Science Applications Human-Computer Interaction Computer Science - Distributed Parallel and Cluster Computing 020204 information systems Computer cluster Distributed data store 0202 electrical engineering electronic engineering information engineering Media Technology Programming paradigm 020201 artificial intelligence & image processing Distributed Parallel and Cluster Computing (cs.DC) Cluster analysis Information Systems Network analysis |
Zdroj: | Social Network Analysis and Mining. 6 |
ISSN: | 1869-5469 1869-5450 |
DOI: | 10.1007/s13278-016-0394-1 |
Popis: | Complex networks are heterogeneous relational data sets with nontrivial substructures and statistical properties. They are typically represented as graphs consisting of vertices and edges. The analysis of their intricate structure is relevant to many areas of science and commerce, and data sets may reach sizes that require distributed storage and processing. We describe and compare programming models for distributed computing with a focus on graph algorithms for large-scale complex network analysis. Four frameworks-GraphLab, Apache Giraph, Giraph++ and Apache Flink—are used to implement algorithms for the representative problems connected components, community detection, PageRank and clustering coefficients. The implementations are executed on a computer cluster to evaluate the frameworks’ suitability in practice and to compare their performance to that of the single-machine, shared-memory parallel network analysis package NetworKit. Out of the distributed frameworks, GraphLab and Apache Giraph generally show the best performance. In our experiments a cluster of eight computers running Apache Giraph enables the analysis of a network with ca. 2 billion edges, which is too large for a single machine of the same type. However, for networks that fit into memory of one machine, the performance of the shared-memory parallel implementation is usually far better than the distributed ones. The study provides experimental evidence for selecting the appropriate framework depending on the task and data volume. |
Databáze: | OpenAIRE |
Externí odkaz: |