D 3 -MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

Autor: He, Haiwu, Simonet, Anthony, Anjos, Julio, Saray, José-Francisco, Fedak, Gilles, Tang, Bing, Lu, Lu, Shi, Xuanhua, Jin, Hai, Moca, Mircea, Silaghi, Gheorghe, Ben Cheikh, Asma, Abbes, Heithem
Přispěvatelé: Computer Network Information Center [Beijing] (CNIC), Chinese Academy of Sciences [Beijing] (CAS), Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Algorithms and Software Architectures for Distributed and HPC Platforms (AVALON), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), School of Computer Science and Engineering [ Changsha], Hunan University of Science and Technology [Xiangtan], Huazhong University of Science and Technology [Wuhan] (HUST), Universitatea Babeş-Bolyai [Cluj-Napoca], Technologie de l'Information et de la Communication (UTIC), École Supérieure des Sciences et Technologies de Tunis, International Science & Technology Cooperation Program of China under grant No. 2015DFE12860, and NSFC under grant No. 61370104, by the French National Research Agency (MapReduce ANR-10-SEGI-001) and by the Chinese Academy of Sciences President’s International Fellowship Initiative (PIFI) 2015 Grant No. 2015VTB064., ANR-10-SEGI-0001,MapReduce,Traitement intensif de données à très grande échelle à l'aide du paradigme MapReduce sur des infrastructures de type cloud et hybrides(2010), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École normale supérieure - Lyon (ENS Lyon), Université de Lyon-École normale supérieure - Lyon (ENS Lyon)-Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL)
Jazyk: angličtina
Rok vydání: 2015
Předmět:
Zdroj: International Conference on Big Data Intelligence and Computing (DataCom 2015)
International Conference on Big Data Intelligence and Computing (DataCom 2015), Dec 2015, Chengdu, China
Popis: International audience; Since its introduction in 2004 by Google, MapRe-duce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. The ambition of D 3-MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices; ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D 3-MapReduce environment.
Databáze: OpenAIRE