Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which hash functions to use in a Bloom filter

I've got the following question about choosing hash functions for Bloom filters:

  • Which functions to use?

In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed.

I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about suggestions for the usage of the FNV or Murmur hash function, but not why (or at least without a proof) they are suitable.

Thanks in advance!

like image 717
Torsten Avatar asked Aug 14 '12 14:08

Torsten


People also ask

Which could be an example for Bloom filter algorithm?

An example of a Bloom filter, representing the set {x, y, z} . The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z} , because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.

Why do Bloom filters use multiple hash functions?

The more hash functions you have, the more bits will be set in the Bloom filter when you create the filter (because each element you insert causes up to k bits to be set in the filter). The more bits that are set, the higher the risk of false positives.

How many hash functions Bloom filter?

1, the Bloom filter is 32 bits per item (m/n = 32). At this point, 22 hash functions are used to minimize the false positive rate. However, adding hash functions does not significantly reduce the error rate when more than 10 hash functions have been used. Equation (2) is the basic formula of Bloom filter.

Is a Bloom filter a hash table?

In hash table the object gets stored to the bucket(index position in the hashtable) the hash function maps to. Bloom filters doesn't store the associated object. It just tells whether it is there in the bloom filter or not. Hash tables are less space efficient.


2 Answers

I asked myself the same question when building a Java Bloom filter library. See the Github readme for a detailed treatment of my analysis of hash functions for Bloom filters.

I looked at the problem from two perspectives:

  • How fast is the computation?
  • How uniform is the output distribution?

Speed can easily be measured by benchmarks on random input. Uniformity is a bit harder and requires some statistics. Using Chi-Square goodness of fit tests I measured how similar the distribution of hash values is to a uniform distribution.

The result is:

  • Use Murmur3 for the best trade-off between speed and uniformity. Do not use Murmur2 as it is not uniform for inputs that change in small increments.
  • Use a cryptographic hash function like SHA-256 for the best uniformity.
  • Apply the Kirsch-Mitzenmacher-Optimization to only compute 2 instead of k hash functions (hash_i = hash1 + i x hash2).

If your implementation is using Java I would recommend using our Bloom filter hash library. It is well documented and thoroughly tested. For the details, including the benchmark results for different hash function and their unformity according to Chi-Square test, see the Github readme of the repo.

like image 54
DivineTraube Avatar answered Oct 17 '22 06:10

DivineTraube


Hash Functions should provide you with graphical proof of why FNV would be a bad choice, and why Murmur2 or one of Bob Jenkins' Hashes would be a good choice.

like image 5
Guy Gordon Avatar answered Oct 17 '22 04:10

Guy Gordon