Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing the approximate population of a bloom filter

Given a bloom filter of size N-bits and K hash functions, of which M-bits (where M <= N) of the filter are set.

Is it possible to approximate the number of elements inserted into the bloom filter?

Simple Example

I've been mulling over the following example, assuming a BF of 100-bits and 5 hash functions where 10-bits are set...

Best case scenario: Assuming the hash functions are really perfect and uniquely map a bit for some X number of values, then given 10-bits have been set we can say that there has only been 2 elements inserted into the BF

Worst case scenario: Assuming the hash functions are bad and consistently map to the same bit (yet unique amongst each other), then we can say 10 elements have been inserted into the BF

The range seems to be [2,10] where abouts in this range is probably determined by the false-positive probability of filter - I'm stuck at this point.

like image 275
Xander Tulip Avatar asked Feb 01 '12 21:02

Xander Tulip


1 Answers

This question worries me a bit because there are better algorithms for approximately counting the number of distinct elements with small amounts of storage.

Nevertheless, if we must use a Bloom filter, let's assume that the hash functions are random oracles (all values chosen independently, or "really perfect", not to be confused with perfect hashing). Now we have a balls and bins problem: given that M of N bins have balls in them, how many balls did we throw? Let B be the number of balls thrown; the number of items is B/K, since for each item we throw K balls.

The standard approximation for balls and bins processes is to model each bin as an independent Poisson process; the time before a bin becomes occupied is exponentially distributed. Letting 1 be the time it takes to throw all of the balls, the maximum likelihood estimate λ of the rate of this exponential distribution satisfies Pr(Exponential[λ] < 1) = M/N, so 1 - exp(-λ) = M/N and λ = -log(1 - M/N). The parameter λ is akin to the number of balls, so the estimate for the number of items is B ≈ -N log(1 - M/N)/K.

EDIT: There are N bins, so we need to multiply by N.

like image 147
swen Avatar answered Sep 30 '22 09:09

swen