Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving the distribution of hash function values

Tags:

hash

bigdata

Suppose I have a very large number of strings (say 10 billion strings of ~50 characters each). I want to distribute the strings into exactly 10 buckets. Each bucket should hold about 10% of the strings. With a hash function h() I can do:

int bucket_for_s = h(s) % 10

However this provides no guarantee about the evenness of the distribution. Suppose I do the above for all strings and find that 30% go to bucket 1, 5% go to bucket 2 and so on. My question is:

Given h() distribution, is there a way to generate a new hash function h2() that will distribute the strings more evenly?

Alternatively, is there a process that can generate a series of hash functions h2(), h3()... so that 1: each hash function is better than the previous one and 2: I only have to generate a reasonable number of hash functions?

I should also mention that unfortunately I can't simply split the input into 10 parts because my input is spread across several machines. I am looking for a deterministic solution I can apply to each machine separately and get the same results (so eventually "hello" would go to bucket x, no matter on which machines it was stored).

like image 758
user1424934 Avatar asked Aug 24 '12 00:08

user1424934


People also ask

How do you make a good hash function?

A good hash function to use with integer key values is the mid-square method. The mid-square method squares the key value, and then takes out the middle r bits of the result, giving a value in the range 0 to 2r−1. This works well because most or all bits of the key value contribute to the result.

Are hashes uniformly distributed?

So yes, hash functions should have uniformly distributed values.

Are hashes randomly distributed?

Yes. Any hash function which exhibits the property of uniformity has equal chance of any value in its output range being generated by a randomly chosen input value.

What happens to a hash if you change the input values?

Once you create a hash, the only way to get the same exact hash is to input the same text. If you change even just one character, the hash value will change as well.


2 Answers

Cryptographically solid hash functions should already have a very even distribution across all bits of the hash output.

If you are using something like Java's hashCode() which I believe looks like

s[0]*31^(n-1) + s1*31^(n-2) + ... + s[n-1]

you may well see a less than ideal hash distribution.

Try using a cryptographic hash such as SHA-256 as a basis.

Google's City Hash is less well distributed than SHA-256, but is much faster. That may provide sufficient distribution at less computational expense.

like image 148
Eric J. Avatar answered Sep 25 '22 06:09

Eric J.


Chaining hash functions or generating a series of hash functions would be unneccesarily computationally expensive. You should rather use a hash function that already has the required properties out-of-the-box.

Possible candidates

From what you described, the hash function should be deterministic (your "hello" example) - this is true for all hash functions - and should generate an even distribution.

A cryptographic hash such as SHA-256 should meet your requirements, as it outputs completely different hashes even for only slightly different inputs like "hello" and "hallo". By using the modulo (%) operation on the hash, you can then have as many buckets as you like (not more than the number of hashes of course).

However, cryptographic hash functions are built for security and checksums and involve some complex computation. In your case, it is very likely that you will not need the strong security-related properties they provide.

You may rather want to look for so-called "non-cryptographic hash functions" which have relaxed properties and are more designed for lookups - so they are optimized for speed. Java's hashCode(), MurmurHash and the already mentioned CityHash (Google announcement) might be a good start.

Deterministic nature of hash functions vs. even distribution of hashes

That said, as hash functions are deterministic regarding the input, the hash for a certain input as "hello" will always be the same, even if you call the hash function multiple times. If your data set contains some elements with a lot of exact duplicates (e.g. "a" and "the" are usual suspects for tokenized texts), this can easily lead to un-uniformly sized buckets, no matter which hash function you use.

Assuming you want to use the even distribution of hashes for even distribution of workload, this can be overcome using the following strategy. Think of each bucket as a work package or job that can be processed by any of the available machines. If you have more work packages than machines (let's say 20 or 30 packages for 10 machines), you can evenly distribute the workload as long as you allow for flexible scheduling. When machine A gets one of the oversized packages and takes some time to process it, machine B could process two small or medium-sized packages in the same time, thus the overall performance impact of the oversized package is reduced.

like image 33
cyroxx Avatar answered Sep 23 '22 06:09

cyroxx