A two-level formal model for Big Data processing programs

Autor: João Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar, Martin A. Musicante
Přispěvatelé: Universidade Federal do Rio de Janeiro (UFRJ), Laboratoire Franco-Mexicain d'Informatique et d'Automatique (LAFMIA), Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centro de Investigacion y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV)-Université de Technologie de Compiègne (UTC)-Consejo Nacional de Ciencia y Tecnología [Mexico] (CONACYT)-Centre National de la Recherche Scientifique (CNRS), Universidade Federal do Rio Grande do Norte [Natal] (UFRN), CAPES, Brésil, Univeridade Federal Rio Grande do Norte - CNRS, LIRIS
Rok vydání: 2022
Předmět:
Zdroj: Science of Computer Programming
Science of Computer Programming, Elsevier, In press
ISSN: 0167-6423
DOI: 10.1016/j.scico.2021.102764
Popis: International audience; This paper proposes a model for specifying data flow-based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on Monoid Algebra and Petri Nets to abstract Big Data processing programs in two levels: a higher level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs, for modeling iterative data processing programs. The general specification of these programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics algorithms. Indeed, these algorithms call for revisiting parallel programming models to express iterations. The paper gives a comparative analysis of the iteration strategies proposed by Apache Spark, DryadLINQ, Apache Beam, and Apache Flink. It discusses how the model achieves to generalize these strategies.
Databáze: OpenAIRE