Wikipedia says:
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.
I read the article, but what I don't understand is how k is determined. Is it a function of the table size?
Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more than 50% of the buckets in the table were filled, I would double the size of the table. I suspect you might still want to do this with a bloom filter to reduce false positives. Correct ?
1, the Bloom filter is 32 bits per item (m/n = 32). At this point, 22 hash functions are used to minimize the false positive rate. However, adding hash functions does not significantly reduce the error rate when more than 10 hash functions have been used. Equation (2) is the basic formula of Bloom filter.
The more hash functions you have, the more bits will be set in the Bloom filter when you create the filter (because each element you insert causes up to k bits to be set in the filter). The more bits that are set, the higher the risk of false positives.
Choice of Hash Function The hash function used in bloom filters should be independent and uniformly distributed. They should be fast as possible. Fast simple non cryptographic hashes which are independent enough include murmur, FNV series of hash functions and Jenkins hashes.
Find the optimal false positive rate and determine the number of hash functions. From Formula 2, the ideal number of hash functions should be k ≈ 0.693 ∗ 8 ∗ 106 ⁄ 106 = 5.544. Formula 3 tells us that the false positive rate is f ≈ (1⁄2)5.544 ≈ 0.0214, but we need a legal value of k.
Given:
n
: how many items you expect to have in your filter (e.g. 216,553)p
: your acceptable false positive rate {0..1} (e.g. 0.01
→ 1%)we want to calculate:
m
: the number of bits needed in the bloom filterk
: the number of hash functions we should applyThe formulas:
m = -n*ln(p) / (ln(2)^2)
the number of bitsk = m/n * ln(2)
the number of hash functions
In our case:
m
= -216553*ln(0.01) / (ln(2)^2)
= 997263 / 0.48045
= 2,075,686
bits (253 kB)k
= m/n * ln(2)
= 2075686/216553 * 0.693147
= 6.46
hash functions (7 hash functions)Note: Any code released into public domain. No attribution required.
If you read further down in the Wikipedia article about Bloom filters, then you find a section Probability of false positives. This section explains how the number of hash functions influences the probabilities of false positives and gives you the formula to determine k from the desired expected prob. of false positives.
Quote from the Wikipedia article:
Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With