Popis: |
Graph processing is one of the most important topics in big data processing. The graph architecture is suitable for distributed processing as the processing works in an iterative manner allowing parallelism. Also, the structure has proved to be suitable in representing social networks, web page indexes, and many other problems. However, graph processing introduce many problems as well. Partitioning the graph to distribute the data on multiple machines and minimizing data movement is a serious challenge. Also many of the graph algorithms have high complexity. GraphX is one of the frameworks that introduce an abstraction on top of Spark, an iterative data processing engine. However, GraphX and other novel graph abstractions still do not support processing data streams with online graphs. In this work we try to use IndexedRDD, a library to enable fine grained updates as a key-value store on top of Spark to represent a graph structure and test if it can be used as an efficient online graph storage for spark streaming. We did experiments to compare our data streaming implementation using IndexedRDD with the obvious elementary solution of using RDD transformations to join the old RDD with the new one to make a new composite RDD on each micro-batch. We also want to compare the above two with a distributed in-memory key-value store (such as Redis). The results show big advantage of using Redis over RDD transformations and IndexedRDD. However, it has some limitations such as lacking the support for property graphs. IndexedRDD, on the other hand, has shown good performance for insertions and a shortcoming in its need to rebuild the index after each data update, which add extra time on each lookup that cannot be tolerated when lookup speed is essential. |