An empirical comparison of Big Graph frameworks in the context of network analysis

Autor:	Jannis Koch, Henning Meyerhenke, Christian L. Staudt, Maximilian Vogel
Rok vydání:	2016
Předmět:	Social and Information Networks (cs.SI) FOS: Computer and information sciences Connected component Theoretical computer science Relational database Computer science Communication Computer Science - Social and Information Networks 02 engineering and technology Complex network Computer Science Applications Human-Computer Interaction Computer Science - Distributed Parallel and Cluster Computing 020204 information systems Computer cluster Distributed data store 0202 electrical engineering electronic engineering information engineering Media Technology Programming paradigm 020201 artificial intelligence & image processing Distributed Parallel and Cluster Computing (cs.DC) Cluster analysis Information Systems Network analysis
Zdroj:	Social Network Analysis and Mining. 6
ISSN:	1869-5469 1869-5450
DOI:	10.1007/s13278-016-0394-1
Popis:	Complex networks are heterogeneous relational data sets with nontrivial substructures and statistical properties. They are typically represented as graphs consisting of vertices and edges. The analysis of their intricate structure is relevant to many areas of science and commerce, and data sets may reach sizes that require distributed storage and processing. We describe and compare programming models for distributed computing with a focus on graph algorithms for large-scale complex network analysis. Four frameworks-GraphLab, Apache Giraph, Giraph++ and Apache Flink—are used to implement algorithms for the representative problems connected components, community detection, PageRank and clustering coefficients. The implementations are executed on a computer cluster to evaluate the frameworks’ suitability in practice and to compare their performance to that of the single-machine, shared-memory parallel network analysis package NetworKit. Out of the distributed frameworks, GraphLab and Apache Giraph generally show the best performance. In our experiments a cluster of eight computers running Apache Giraph enables the analysis of a network with ca. 2 billion edges, which is too large for a single machine of the same type. However, for networks that fit into memory of one machine, the performance of the shared-memory parallel implementation is usually far better than the distributed ones. The study provides experimental evidence for selecting the appropriate framework depending on the task and data volume.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::4684be742c1e52fc2970d39e4401af07 https://doi.org/10.1007/s13278-016-0394-1 Zobrazit plný text záznamu Full text from SpringerLink