Improving I/O Efficiency in Hadoop-Based Massive Data Analysis Programs

Autor:	Young-Kyoon Suh, Kyong-Ha Lee, Woo Lam Kang
Jazyk:	angličtina
Rok vydání:	2018
Předmět:	Article Subject Computer science business.industry Data layout Conventional analysis Big data Search engine indexing InformationSystems_DATABASEMANAGEMENT 020206 networking & telecommunications 02 engineering and technology Parallel computing Software_PROGRAMMINGTECHNIQUES Computer Science Applications QA76.75-76.765 Parallel processing (DSP implementation) 020204 information systems 0202 electrical engineering electronic engineering information engineering Selection (linguistics) Data_FILES Computer software InformationSystems_MISCELLANEOUS Inefficiency business Software
Zdroj:	Scientific Programming, Vol 2018 (2018)
ISSN:	1058-9244
Popis:	Apache Hadoop has been a popular parallel processing tool in the era of big data. While practitioners have rewritten many conventional analysis algorithms to make them customized to Hadoop, the issue of inefficient I/O in Hadoop-based programs has been repeatedly reported in the literature. In this article, we address the problem of the I/O inefficiency in Hadoop-based massive data analysis by introducing our efficient modification of Hadoop. We first incorporate a columnar data layout into the conventional Hadoop framework, without any modification of the Hadoop internals. We also provide Hadoop with indexing capability to save a huge amount of I/O while processing not only selection predicates but also star-join queries that are often used in many analysis tasks.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::38e6329e33032766224b014ffbcd3981 https://doaj.org/article/a54558e381844bdfb7dd5bf19cd6d3f0 Zobrazit plný text záznamu Plný text