What are buckets in terms of hash functions?

Tags:

hash-function

Looking at the book Mining of Massive Datasets, section 1.3.2 has an overview of Hash Functions. Without a computer science background, this is quite new to me; Ruby was my first language, where a hash seems to be equivalent to Dictionary<object, object>. And I had never considered how this kind of datastructure is put together.

The book mentions hash functions, as a means of implementing these dictionary data structures. This paragraph:

First, a hash function h takes a hash-key value as an argument and produces a bucket number as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any type. There is an intuitive property of hash functions that they “randomize” hash-keys

What exactly are buckets in terms of a hash function? it sounds like buckets are array-like structures, and that the hash function is some kind of algorithm / array-like-structure search that produces the same bucket number every time? What is inside this metaphorical bucket?

I've always read that javascript objects/ruby hashes/ etc don't guarantee order. In practice I've found that keys' order doesn't change (actually, I think using an older version of Mozilla's Rhino interpreter that the JS object order DID change, but I can't be sure...).

Does that mean that hashes (Ruby) / objects (JS) ARE NOT resolved by these hash functions?

Does the word hashing take on different meanings depending on the level at which you are working with computers? i.e. it would seem that a Ruby hash is not the same as a C++ hash...

448

asked May 03 '17 14:05

Zach Smith

2 Answers

When you hash a value, any useful hash function generally has a smaller range than the domain. This means that out of a large list of input values (for example all possible combinations of letters) it will output any of a smaller list of values (a number capped at a certain length). This means that more than one input value can map to the same output value.

When this is the case, the output values are refered to as buckets.

Consider the function f(x) = x mod 2

This generates the following outputs;

1 => 1
2 => 0
3 => 1
4 => 0

In this case there are two buckets (1 and 0), with a bunch of input values that fall into each.

A good hash function will fill all of these 'buckets' equally, and so enable faster searching etc. If you take the mod of any number, you get the bucket to look into, and thus have to search through less results than if you just searched initially, since each bucket has less results in it than the whole set of inputs. In the ideal situation, the hash is fast to calculate and there is only one result in each bucket, this enables lookups to take only as long as applying the hash function takes.

This is a simplified example of course but hopefully you get the idea?

104

answered Oct 05 '22 07:10

Milney

The concept of a hash function is always the same. It's a function that calculates some number to represent an object. The properties of this number should be:

it's relatively cheap to compute
it's as different as possible for all objects.

Let's give a really artificial example to show what I mean with this and why/how hashes are usually used.

Take all natural numbers. Now let's assume it's expensive to check if 2 numbers are equal.

Let's also define a relatively cheap hash function as follows:

hash = number % 10

The idea is simple, just take the last digit of the number as the hash. In the explanation you got, this means we put all numbers ending in 1 into an imaginary 1-bucket, all numbers ending in 2 in the 2-bucket etc...

Those buckets don't really exists as data structure. They just make it easy to reason about the hash function.

Now that we have this cheap hash function we can use it to reduce the cost of other things. For example, we want to create a new datastructure to enable cheap searching of numbers. Let's call this datastructure a hashmap.

Here we actually put all the numbers with hash=1 together in a list/set/..., we put the numbers with hash=5 into their own list/set ... etc.

And if we then want to lookup some number, we first calculate it's hash value. Then we check the list/set corresponding to this hash, and then compare only "similar" numbers to find our exact number we want. This means we only had to do a cheap hash calculation and then have to check 1/10th of the numbers with the expensive equality check.

Note here that we use the hash function to define a new datastructure. The hash itself isn't a datastructure.

answered Oct 05 '22 08:10

Imus

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are buckets in terms of hash functions?

Tags:

hash-function

Zach Smith

People also ask

2 Answers

Milney

Imus

Recent Activity

Donate For Us

What are buckets in terms of hash functions?

Tags:

hash-function

Zach Smith

People also ask

2 Answers

Milney

Imus

Related questions

Recent Activity

Donate For Us