Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining Bloom Filters

I am using bloom filters to check for duplicated data in a set. However, there is a need to combine the results of two sets of data into a single filter to check for duplication across the two sets. I devised a function in pseudo-Python to perform this task:

def combine(a : bloom_filter, b : bloom_filter):
    assert a.length == b.length
    assert a.hashes == b.hashes

    c = new bloom_filter(length = a.length, hashes = b.hashes)
    c.attempts = a.attempts + b.attempts
    c.bits = a.bits | b.bits

    # Determining the amount of items
    a_and_b = count(a & b)
    a_not_b = count(a & !b)
    not_a_b = count(!a & b)
    neither = count(!a & !b)
    c.item_count = a_not_b / a.length * a.item_count
                 + not_a_b / b.length * b.item_count
                 + a_and_b / c.length * min(a.item_count, b.item_count)

    return c

Does this even sound correct? I am having considerable internal debate as to whether is is even possible to do what I intend, since much of the information about the source data is lost (which is the point of a bloom filter).

like image 901
Travis Gockel Avatar asked Sep 17 '25 13:09

Travis Gockel


1 Answers

You can derive a formula for estimating the amount of items a Bloom Filter:

c = log(z / N) / ((h * log(1 - 1 / N))

N: Number of bits in the bit vector
h: Number of hashes
z: Number of zero bits in the bit vector

This provides a fairly accurate estimate of the number of items in the Bloom Filter. You can come up with an estimate for contribution with simple subtraction.

like image 120
Travis Gockel Avatar answered Sep 23 '25 14:09

Travis Gockel