Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bloom filters and its multiple hash functions

I'm implementing a simple Bloom Filter as an exercise.

Bloom filters require multiple hash functions, which for practical purposes I don't have.

Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?

As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?

Pseudo-code:

function generateHashes(obj) {
  long hash = murmur3_hash(obj);
  long hash1 = murmur3_hash(hash+1);
  long hash2 = murmur3_hash(hash+2);
  long hash3 = murmur3_hash(hash+3);
  (hash1, hash2, hash3)
}

If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.

Thanks

like image 495
devoured elysium Avatar asked Feb 11 '18 00:02

devoured elysium


People also ask

How many hash functions Bloom filter?

1, the Bloom filter is 32 bits per item (m/n = 32). At this point, 22 hash functions are used to minimize the false positive rate. However, adding hash functions does not significantly reduce the error rate when more than 10 hash functions have been used. Equation (2) is the basic formula of Bloom filter.

Which hash function is used in Bloom filter?

A bloom filter also includes a set of k k k hash functions with which we hash incoming values. These hash functions must all have a range of 0 to m − 1 m - 1 m−1. If these hash functions match an incoming value with an index in the bit array, the bloom filter will make sure the bit at that position in the array is 1.

What are Bloom filters used for?

A Bloom filter, named after its inventor Burton Howard Bloom, is a data structure that can be used to perform a cheap test for the potential presence of a particular value, in a way that is much faster than looking up the value in an index, requiring much less storage than the index would. Note the “potential” there.

Is a Bloom filter a hash table?

In hash table the object gets stored to the bucket(index position in the hashtable) the hash function maps to. Bloom filters doesn't store the associated object. It just tells whether it is there in the bloom filter or not. Hash tables are less space efficient.


2 Answers

AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.

https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview

like image 84
memo Avatar answered Oct 14 '22 18:10

memo


The hashing functions of Bloom filter should be independent and random enough. murmur hash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.

But in real world, running hashing function multiple times is very time costing, so the usual approach is to create ad-hoc hashes without actually calculating the hash.

To correct @memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.

function generateHashes(obj) {
  // initialization phase
  long h1 = murmur3_hash(obj);
  long h2 = murmur3_hash(h1);

  int k = 3; // number of desired hash functions
  long hash[k];

  // generation phase
  for (int i=0; i<k; i++) {
      hash[i] = h1 + (i*h2);

  return hash;
}

As you see, this way creating a new hash is a simple multiply-add operation.

like image 26
igrinis Avatar answered Oct 14 '22 17:10

igrinis