A two-level formal model for Big Data processing programs
Autor: | João Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar, Martin A. Musicante |
---|---|
Přispěvatelé: | Universidade Federal do Rio de Janeiro (UFRJ), Laboratoire Franco-Mexicain d'Informatique et d'Automatique (LAFMIA), Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centro de Investigacion y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV)-Université de Technologie de Compiègne (UTC)-Consejo Nacional de Ciencia y Tecnología [Mexico] (CONACYT)-Centre National de la Recherche Scientifique (CNRS), Universidade Federal do Rio Grande do Norte [Natal] (UFRN), CAPES, Brésil, Univeridade Federal Rio Grande do Norte - CNRS, LIRIS |
Rok vydání: | 2022 |
Předmět: |
[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB]
spark big data petri net 0202 electrical engineering electronic engineering information engineering 020206 networking & telecommunications 020207 software engineering [INFO.INFO-SE]Computer Science [cs]/Software Engineering [cs.SE] 02 engineering and technology mutation [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] monoid algebra Software |
Zdroj: | Science of Computer Programming Science of Computer Programming, Elsevier, In press |
ISSN: | 0167-6423 |
DOI: | 10.1016/j.scico.2021.102764 |
Popis: | International audience; This paper proposes a model for specifying data flow-based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on Monoid Algebra and Petri Nets to abstract Big Data processing programs in two levels: a higher level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs, for modeling iterative data processing programs. The general specification of these programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics algorithms. Indeed, these algorithms call for revisiting parallel programming models to express iterations. The paper gives a comparative analysis of the iteration strategies proposed by Apache Spark, DryadLINQ, Apache Beam, and Apache Flink. It discusses how the model achieves to generalize these strategies. |
Databáze: | OpenAIRE |
Externí odkaz: |