Translation of Array-Based Loops to Spark SQL
Autor: | Hasanuzzaman Noor, Leonidas Fegaras |
---|---|
Rok vydání: | 2020 |
Předmět: |
SQL
Computer science business.industry Relational database Programming language 05 social sciences Big data Process (computing) 050301 education 02 engineering and technology Translation (geometry) computer.software_genre Imperative programming 020204 information systems Spark (mathematics) 0202 electrical engineering electronic engineering information engineering Code (cryptography) business 0503 education computer computer.programming_language |
Zdroj: | IEEE BigData |
DOI: | 10.1109/bigdata50022.2020.9378136 |
Popis: | Many programs written to analyze data are expressed in terms of array operations in an imperative programming language with loops. However, for data analysts who need to analyze vast volumes of data, large-scale data-intensive processing is becoming a necessity. Hence, they want to convert their programs, originally written to run on a single computer, to work on current Big Data systems, such as Map-Reduce and Spark, so that they can process larger amounts of data. We present a novel framework, called SQLgen, that automatically translates imperative programs with loops and array operations to distributed data-parallel programs. Unlike related work, SQL- gen translates these programs to SQL, which can be translated to more efficient code since it can be optimized using a relational database optimizer. SQLgen has been implemented on Spark SQL. We compare the performance of SQLgen with DIABLO, hand-written RDD-based, and Spark SQL programs on real- world problems. SQLgen is up to 78× faster than DIABLO and up to 25× faster than hand-written RDD-based programs, giving performance close to that of hand-written programs in Spark SQL. |
Databáze: | OpenAIRE |
Externí odkaz: |