A two-level formal model for Big Data processing programs

Autor:	João Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar, Martin A. Musicante
Přispěvatelé:	Universidade Federal do Rio de Janeiro (UFRJ), Laboratoire Franco-Mexicain d'Informatique et d'Automatique (LAFMIA), Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centro de Investigacion y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV)-Université de Technologie de Compiègne (UTC)-Consejo Nacional de Ciencia y Tecnología [Mexico] (CONACYT)-Centre National de la Recherche Scientifique (CNRS), Universidade Federal do Rio Grande do Norte [Natal] (UFRN), CAPES, Brésil, Univeridade Federal Rio Grande do Norte - CNRS, LIRIS
Rok vydání:	2022
Předmět:	[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB] spark big data petri net 0202 electrical engineering electronic engineering information engineering 020206 networking & telecommunications 020207 software engineering [INFO.INFO-SE]Computer Science [cs]/Software Engineering [cs.SE] 02 engineering and technology mutation [INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC] monoid algebra Software
Zdroj:	Science of Computer Programming Science of Computer Programming, Elsevier, In press
ISSN:	0167-6423
DOI:	10.1016/j.scico.2021.102764
Popis:	International audience; This paper proposes a model for specifying data flow-based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on Monoid Algebra and Petri Nets to abstract Big Data processing programs in two levels: a higher level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs, for modeling iterative data processing programs. The general specification of these programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics algorithms. Indeed, these algorithms call for revisiting parallel programming models to express iterations. The paper gives a comparative analysis of the iteration strategies proposed by Apache Spark, DryadLINQ, Apache Beam, and Apache Flink. It discusses how the model achieves to generalize these strategies.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::8332474d71a71e1b132a94c4d61ec73a https://doi.org/10.1016/j.scico.2021.102764 Zobrazit plný text záznamu Full Text from ScienceDirect