Popis: |
The JVM (Java virtual machine) is the cornerstone in most big data frameworks, focusing on automatic memory management and enabling high-productivity languages. Aside from the performance overhead induced by JVM languages (e.g., Java, Scala, etc.), big data frameworks, including Spark, also restrict code execution to general purpose processors (CPUs), while HPC clusters readily include dedicated accelerators for achieving their high performance. In this paper, we analyze the state-of-the-art developments in the field of heterogeneously accelerated Spark, and we propose SparkJNI, a framework for JNI accelerated Spark. The design provides two main components. First, it enables a seamless utilization of native CPU code, in addition to integration of GPU as well as FPGA accelerators. Secondly, SparkJNI enables accelerated execution through native code integration by automatically generating $C++$ code wrappers for easy code development by the programmer. This makes it non-disruptive to the Java programmer, while allowing great flexibility for native code development. Results of running a number of benchmarks show insignificant JNI-induced overhead in access time and bandwidth, with speedups of up to 12x for compute-intensive kernels (such as convolution), in comparison to pure Java Spark implementations. Last, a DNA analysis algorithm (Pair-HMM) is implemented in Spark and integrated with FPGAs, targeting cluster deployments, with benchmark results showing an overall speedup of $\sim 2.7x$ over state-of-the art CPU optimizations. The result of the presented work, along with the SparkJNI framework are publicly available on GitHub for open-source usage and development. |