A seven-dimensional analysis of hashing methods and its implications on query processing
Autor: | Stefan Richter, Victor Alvarez, Jens Dittrich |
---|---|
Rok vydání: | 2015 |
Předmět: |
Theoretical computer science
Universal hashing Computer science Linear probing Dynamic perfect hashing Hash function General Engineering Linear hashing 2-choice hashing Hash table Locality-sensitive hashing Hopscotch hashing K-independent hashing Cuckoo hashing Open addressing Pearson hashing SUHA Locality preserving hashing Feature hashing Consistent hashing Extendible hashing Double hashing Tabulation hashing |
Zdroj: | Proceedings of the VLDB Endowment. 9:96-107 |
ISSN: | 2150-8097 |
DOI: | 10.14778/2850583.2850585 |
Popis: | Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: (1) data-distribution, (2) load factor, (3) dataset size, (4) read/write-ratio, and (5) un/successful-ratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: (6) five representative hashing schemes (which includes an improved variant of Robin Hood hashing), (7) four important classes of hash functions widely used today. That is, we consider 20 different combinations in total. Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly indicates that picking the right combination may have considerable impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be considered a white box before blindly using it in applications, such as query processing. Finally, we also provide a strong guideline about when to use which hashing method. |
Databáze: | OpenAIRE |
Externí odkaz: |