Data Skew Profiling using HPCC Systems

Autor:	G. Shobha, Arjuna Chala, Jayanth S, Jyoti Shetty, Harsh Mishra, Dan Camper
Rok vydání:	2019
Předmět:	business.industry Computer science Big data Skew Volume (computing) 02 engineering and technology Data science Set (abstract data type) Parallel processing (DSP implementation) 020204 information systems 0202 electrical engineering electronic engineering information engineering Profiling (information science) Table (database) sort 020201 artificial intelligence & image processing business
Zdroj:	Proceedings of the 2019 International Conference on Big Data and Education.
DOI:	10.1145/3322134.3322142
Popis:	Over the last few decades, there has been a tremendous increase in the volume of data available for analysis in various domains. Although processing power has scaled up as well, it is well known that the rate of increase of data far supersedes the higher processing capabilities of modern processors. The natural consequence to the advent of big data was distribution of data across multiple nodes to facilitate not only storage but also parallel processing. The advent of the age of large volumes of data came to be known as the era of big data. The distribution of data among various machines posed a fundamental problem in big data as well as distributed computing: The impact of data skew. We worked on a project to profile data skew on a multi-computing cluster. This paper summarizes our efforts and findings. We use HPCC Systems, a modern big data management and analysis tool. In this project, we analyze the impact of differently skewed data distributions on the most common database operations, namely, NORMALIZE, DENORMALIZE, JOIN, SORT, TABLE, and PROJECT using a set of queries, and analyzing their runtimes.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f223991815cd26a6ef1f614853e1e7d0 https://doi.org/10.1145/3322134.3322142 Zobrazit plný text záznamu