Parallelization of Algorithms for Mining Data from Distributed Sources

Autor: Maria Efimova, Ivan Kholod, Sergei Gorlatch, Andrey Shorov
Rok vydání: 2019
Předmět:
Zdroj: Lecture Notes in Computer Science ISBN: 9783030256357
PaCT
DOI: 10.1007/978-3-030-25636-4_23
Popis: We suggest an approach to optimize data mining in modern applications that work on distributed data. We formally transform a high-level functional representation of a data-mining algorithm into a parallel implementation that performs as much as possible computations locally at the data sources, rather than accumulating all data for processing at a central location as in the traditional MapReduce approach. Our approach avoids the main disadvantages of the state-of-the-art MapReduce frameworks in the context of distributed data: increased run time, high network traffic, and an unauthorized access to data. We use the popular data-mining algorithm – Naive Bayes – for illustrating our approach and evaluating it experimentally. Our experiments confirm that the implementation of Naive Bayes developed by using our approach significantly outperforms the traditional MapReduce-based implementation regarding the run time and the network traffic.
Databáze: OpenAIRE