Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bloom filters in a distributed environment

I have a system consisting of a few application instances, written in Java. Requests to them are load balanced for high availability. Every second, hundreds of small chunks of data (each consisting of a few simple strings) are received by this "cluster", stored in the database, kept for a couple of days and then discarded. Apart from storing this data, the main requirement is to quickly determine if a given value is stored in the database or not. An appropriately indexed and partitioned database table seems suited to the problem and does it's job well, at least for now.

The problem is, about 80% of the searched values are not found because they are not in the database. Therefore I would like to speed things up a bit and make the search faster and less resource-intensive. A bloom filter would be the obvious choice, were it not for the fact that different application instances receive different parts of data and if each application instance has only a part of the data in it's bloom filter then those bloom filters are useless.

Do you have any suggestions/ideas on how to solve this problem?

like image 619
zgguy Avatar asked May 19 '18 06:05

zgguy


1 Answers

kept for a couple of days and then discarded

Bloom filter does not support deleting objects, only inserting.
If you have multiple bloom filters, you have to query them all to check if one of them contains the object you need.

Bloom filters can be effectively merged, if they have the same structure (the same size, the same hash function, etc).

You can use this Bloom filter: https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilter.java

And merge two filters like this:

BloomFilter merge(BloomFilter dstFilter, BloomFilter srcFilter) {
    OpenBitSet dst = dstFilter.bitset;
    OpenBitSet src = srcFilter.bitset;

    for (int i = 0; i < src.getPageCount(); ++i) {
        long[] dstBits = dst.getPage(i);
        long[] srcBits = src.getPage(i);

        for (int j = 0; j < srcBits.length; ++j) {
            dstBits[j] |= srcBits[j];
        }
        dst.setPage(i, dstBits);
    }
    return dstFilter;
}
like image 109
Viacheslav Shalamov Avatar answered Sep 23 '22 06:09

Viacheslav Shalamov