A distributed data management system to support large-scale data analysis
Autor: | Tamer Z. Emara, Joshua Zhexue Huang |
---|---|
Rok vydání: | 2019 |
Předmět: |
Scheme (programming language)
Computer science business.industry Data management 05 social sciences Big data 020207 software engineering 02 engineering and technology computer.software_genre Set (abstract data type) Hardware and Architecture 0502 economics and business Data file 0202 electrical engineering electronic engineering information engineering Key (cryptography) Data mining business computer 050203 business & management Software Information Systems computer.programming_language Block (data storage) |
Zdroj: | Journal of Systems and Software. 148:105-115 |
ISSN: | 0164-1212 |
Popis: | Distributed data management is a key technology to enable efficient massive data processing and analysis in cluster-computing environments. Specifically, in environments where the data volumes are beyond the system capabilities, big data files are required to be summarized by representative samples with the same statistical properties as the whole dataset. This paper proposes a big data management system (BDMS) based on distributed random sample data blocks. It presents a high-level architecture design of the BDMS which extends the current distributed file systems. This system offers certain functionalities for block-level management such as statistically-aware data partitioning, data blocks organization, and data blocks selection. This paper also presents a round-random partitioning scheme to represent a big dataset as a set of non-overlapping data blocks; each block is a random sample of the whole dataset. Based on the presented scheme, two algorithms are introduced as an implementation strategy to convert the HDFS blocks of a big file into a set of random sample data blocks which is also stored in HDFS. The experimental results show that the execution time of partitioning operation is acceptable in the real applications because this operation is only performed once on each input data file. |
Databáze: | OpenAIRE |
Externí odkaz: |