Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many hash functions does my bloom filter need?

Tags:

Wikipedia says:

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

I read the article, but what I don't understand is how k is determined. Is it a function of the table size?

Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more than 50% of the buckets in the table were filled, I would double the size of the table. I suspect you might still want to do this with a bloom filter to reduce false positives. Correct ?

like image 953
dicroce Avatar asked Mar 18 '09 14:03

dicroce


People also ask

How many hash functions Bloom filter?

1, the Bloom filter is 32 bits per item (m/n = 32). At this point, 22 hash functions are used to minimize the false positive rate. However, adding hash functions does not significantly reduce the error rate when more than 10 hash functions have been used. Equation (2) is the basic formula of Bloom filter.

Why do Bloom filters use multiple hash functions?

The more hash functions you have, the more bits will be set in the Bloom filter when you create the filter (because each element you insert causes up to k bits to be set in the filter). The more bits that are set, the higher the risk of false positives.

Which hash function is used in Bloom filter?

Choice of Hash Function The hash function used in bloom filters should be independent and uniformly distributed. They should be fast as possible. Fast simple non cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes.

How do you find the optimal number of hash functions?

Find the optimal false positive rate and determine the number of hash functions. From Formula 2, the ideal number of hash functions should be k ≈ 0.693 ∗ 8 ∗ 106 ⁄ 106 = 5.544. Formula 3 tells us that the false positive rate is f ≈ (1⁄2)5.544 ≈ 0.0214, but we need a legal value of k.


2 Answers

Given:

  • n: how many items you expect to have in your filter (e.g. 216,553)
  • p: your acceptable false positive rate {0..1} (e.g. 0.01 → 1%)

we want to calculate:

  • m: the number of bits needed in the bloom filter
  • k: the number of hash functions we should apply

The formulas:

m = -n*ln(p) / (ln(2)^2) the number of bits
k = m/n * ln(2) the number of hash functions

In our case:

  • m = -216553*ln(0.01) / (ln(2)^2) = 997263 / 0.48045 = 2,075,686 bits (253 kB)
  • k = m/n * ln(2) = 2075686/216553 * 0.693147 = 6.46 hash functions (7 hash functions)

Note: Any code released into public domain. No attribution required.

like image 96
Ian Boyd Avatar answered Oct 01 '22 16:10

Ian Boyd


If you read further down in the Wikipedia article about Bloom filters, then you find a section Probability of false positives. This section explains how the number of hash functions influences the probabilities of false positives and gives you the formula to determine k from the desired expected prob. of false positives.


Quote from the Wikipedia article:

Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is

formula

like image 20
f3lix Avatar answered Oct 01 '22 17:10

f3lix