Index-based join operations in Hive

Autor:	Thiruvengadam Radhakrishnan, Nematollaah Shiri, Mahsa Mofidpoor
Rok vydání:	2013
Předmět:	Speedup Database Computer science business.industry Online analytical processing Big data Search engine indexing Process (computing) InformationSystems_DATABASEMANAGEMENT computer.software_genre Scalability Benchmark (computing) Join (sigma algebra) Data mining business computer
Zdroj:	BigData Conference
DOI:	10.1109/bigdata.2013.6691768
Popis:	Indexing techniques are crucial for efficiency and scalability of processing queries over big data. Hive is a batch-oriented big data management engine that is well suited for data OLAP and data analysis applications. For very “selective” queries whose output sizes are a small fraction of the contributing data, the brute-force approach suffers from poor performance due to redundant disk I/O's or initiations of extra map operations. We make a first attempt and propose an index-based join technique to speed up the process and integrate it in Hive by mapping our design to the conceptual optimization flow. To evaluate the performance, we create and evaluate test queries on datasets generated using TPC-H benchmark. Our results indicate significant performance gain over relatively large data and/or highly selective queries having a two-way join and a single join condition.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::4988a58db14ba943f4e317e0e1cf452e https://doi.org/10.1109/bigdata.2013.6691768 Zobrazit plný text záznamu