Suppose I have a very large number of strings (say 10 billion strings of ~50 characters each). I want to distribute the strings into exactly 10 buckets. Each bucket should hold about 10% of the strings. With a hash function h() I can do: <pre class="prettyprint"><code>int bucket_for_s = h(s) % 10 </code></pre> However this provides no guarantee about the evenness of the distribution. Suppose I do the above for all strings and find that 30% go to bucket 1, 5% go to bucket 2 and so on. My question is: Given h() distribution, is there a way to generate a new hash function h2() that will distribute the strings more evenly? Alternatively, is there a process that can generate a series of hash functions h2(), h3()... so that 1: each hash function is better than the previous one and 2: I only have to generate a reasonable number of hash functions? I should also mention that unfortunately I can't simply split the input into 10 parts because my input is spread across several machines. I am looking for a deterministic solution I can apply to each machine separately and get the same results (so eventually "hello" would go to bucket x, no matter on which machines it was stored).

Cryptographically solid hash functions should already have a very even distribution across all bits of the hash output. If you are using something like Java's <code>hashCode()</code> which I believe looks like <blockquote> s[0]*31^(n-1) + s1*31^(n-2) + ... + s[n-1] </blockquote> you may well see a less than ideal hash distribution. Try using a cryptographic hash such as SHA-256 as a basis. Google's City Hash is less well distributed than SHA-256, but is much faster. That may provide sufficient distribution at less computational expense.

Improving the distribution of hash function values

Tags:

hash

bigdata

Suppose I have a very large number of strings (say 10 billion strings of ~50 characters each). I want to distribute the strings into exactly 10 buckets. Each bucket should hold about 10% of the strings. With a hash function h() I can do:

int bucket_for_s = h(s) % 10

However this provides no guarantee about the evenness of the distribution. Suppose I do the above for all strings and find that 30% go to bucket 1, 5% go to bucket 2 and so on. My question is:

Given h() distribution, is there a way to generate a new hash function h2() that will distribute the strings more evenly?

Alternatively, is there a process that can generate a series of hash functions h2(), h3()... so that 1: each hash function is better than the previous one and 2: I only have to generate a reasonable number of hash functions?

I should also mention that unfortunately I can't simply split the input into 10 parts because my input is spread across several machines. I am looking for a deterministic solution I can apply to each machine separately and get the same results (so eventually "hello" would go to bucket x, no matter on which machines it was stored).

758

asked Aug 24 '12 00:08

user1424934

2 Answers

Cryptographically solid hash functions should already have a very even distribution across all bits of the hash output.

If you are using something like Java's hashCode() which I believe looks like

s[0]*31^(n-1) + s1*31^(n-2) + ... + s[n-1]

you may well see a less than ideal hash distribution.

Try using a cryptographic hash such as SHA-256 as a basis.

Google's City Hash is less well distributed than SHA-256, but is much faster. That may provide sufficient distribution at less computational expense.

148

answered Sep 25 '22 06:09

Eric J.

Chaining hash functions or generating a series of hash functions would be unneccesarily computationally expensive. You should rather use a hash function that already has the required properties out-of-the-box.

Possible candidates

From what you described, the hash function should be deterministic (your "hello" example) - this is true for all hash functions - and should generate an even distribution.

A cryptographic hash such as SHA-256 should meet your requirements, as it outputs completely different hashes even for only slightly different inputs like "hello" and "hallo". By using the modulo (%) operation on the hash, you can then have as many buckets as you like (not more than the number of hashes of course).

However, cryptographic hash functions are built for security and checksums and involve some complex computation. In your case, it is very likely that you will not need the strong security-related properties they provide.

You may rather want to look for so-called "non-cryptographic hash functions" which have relaxed properties and are more designed for lookups - so they are optimized for speed. Java's hashCode(), MurmurHash and the already mentioned CityHash (Google announcement) might be a good start.

Deterministic nature of hash functions vs. even distribution of hashes

That said, as hash functions are deterministic regarding the input, the hash for a certain input as "hello" will always be the same, even if you call the hash function multiple times. If your data set contains some elements with a lot of exact duplicates (e.g. "a" and "the" are usual suspects for tokenized texts), this can easily lead to un-uniformly sized buckets, no matter which hash function you use.

Assuming you want to use the even distribution of hashes for even distribution of workload, this can be overcome using the following strategy. Think of each bucket as a work package or job that can be processed by any of the available machines. If you have more work packages than machines (let's say 20 or 30 packages for 10 machines), you can evenly distribute the workload as long as you allow for flexible scheduling. When machine A gets one of the oversized packages and takes some time to process it, machine B could process two small or medium-sized packages in the same time, thus the overall performance impact of the oversized package is reduced.

answered Sep 23 '22 06:09

cyroxx

Related questions
                            
                                How are Hash Trees useful?
                            
                                How can I get all object vals and subobject vals using reflection in Scala?
                            
                                save password as salted hash in mongodb in users collection using python/bcrypt
                            
                                Compare values of two hashes with mixed types
                            
                                How do I get the definition order of class attributes in Python?
                            
                                MD5 and SHA-2 collisions in Python
                            
                                In ruby, why does array append fail on a hash with a default value of empty array? [duplicate]
                            
                                Is this a secure way to hash a password?
                            
                                Is comparing strings in MySQL vulnerable to timing attacks?
                            
                                How to get the desired formatted hash from a string in an efficient way with Ruby?
                            
                                Initializing a Hash with empty array unexpected behaviour [duplicate]
                            
                                Rails - How to save a hash to the DB and using it as a hash once you pull it form the DB (not pulling it as a sting)
                            
                                How to resolve this " Uncaught TypeError: Cannot convert undefined or null to object "
                            
                                Multidimensional Hash Elements in Perl6
                            
                                Uniquely identifying URLs with one 64-bit number
                            
                                How do I interpolate a Perl hash element in a string?
                            
                                How can I get a metric on the number of collisions on a Java Hashmap?
                            
                                How do I recursively walk a nested hash data structure?
                            
                                Why are custom objects not equivalent keys for a HashMap?
                            
                                Convert flat array [k1,v1,k2,v2] to object {k1:v1,k2:v2} in JavaScript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With