An Efficient Approach for Improving Recursive Joins Based on Three-Way Joins in Spark
Autor: | Thuong-Cang Phan, Thanh-Ngoan Trieu, Anh-Cang Phan |
---|---|
Rok vydání: | 2020 |
Předmět: | |
Zdroj: | Advances in Computational Collective Intelligence ISBN: 9783030631185 ICCCI (CCIS Volume) |
DOI: | 10.1007/978-3-030-63119-2_46 |
Popis: | In the evolution of Big Data, efficiently processing large datasets is always a top concern for researchers. A join operation is one of such processing, a common operation appearing in many data queries. This operation generates plenty of intermediate data and data transmission over the network, especially a recursive join operation. Although extremely expensive, a recursive join has a wide variety of domains as database, social network and computer network analyses, compiler, data integration and graph mining. Therefore, this study was carried out to optimize recursive joins based on some solutions in a Spark environment. The solutions leverage the advantages of three-way join operations, Bloom filters, Spark RDD and caching techniques for iterative join computation. These significantly reduce the number of executed iterations and jobs, the amount of redundant data, and remotely accessing persistent data. Our experimental results show that the optimized recursive join is more efficient than a typical one by reducing the number of iterations to half, minimizing data transfer, and thus shorter execution time. |
Databáze: | OpenAIRE |
Externí odkaz: |