Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Guava Bloom Filter does not support large insertions?

I was using BloomFilter in guava v.11.0.1 and it seems like I am getting an exception when my insertion is large. I tried at 10 million with 0.001 fpp, and it failed.

java.lang.IllegalArgumentException: Number of bits must be positive
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
    at com.google.common.hash.BloomFilterStrategies.checkPositiveAndMakeMultipleOf64(BloomFilterStrategies.java:72)
    at com.google.common.hash.BloomFilterStrategies.access$000(BloomFilterStrategies.java:18)
    at com.google.common.hash.BloomFilterStrategies$From128ToN.withBits(BloomFilterStrategies.java:37)
    at com.google.common.hash.BloomFilter.create(BloomFilter.java:192)
    at com.ipg.collection.BloomFilterWritable.impl(BloomFilterWritable.java:43)
    at com.ipg.collection.BloomFilterWritable.put(BloomFilterWritable.java:62)
    at com.ipg.prophet.twitter.twitflow.archive.UnzipTweetsProcessAndUpload$ProcessorConsumer.process(UnzipTweetsProcessAndUpload.java:107)
    at com.ipg.prophet.twitter.twitflow.archive.UnzipTweetsProcessAndUpload$ProcessorConsumer.run(UnzipTweetsProcessAndUpload.java:84)
    at java.lang.Thread.run(Thread.java:662)

I think at least it should support that many insertions with such a high fpp, shouldn't it?

like image 288
krispyjala Avatar asked Feb 02 '12 00:02

krispyjala


People also ask

What is not addressed by Bloom filter?

Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision. Without hash collision, it would not be compact.

Do Bloom filters have false negatives?

Bloom filters do not store the items themselves and they use less space than the lower theoretical limit required to store the data correctly, and therefore, they exhibit an error rate. They have false positives but they do not have false negatives, and the one-sidedness of this error can be turned to our benefit.

What does Bloom filter Tell us about an item?

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

What is a Bloom filter false positive?

The probability of a false positive – or false positive rate – of a Bloom filter is a function of the randomness of the values generated by the hash functions and of m, n, and k (n is the number of objects mapped into the Bloom filter).


1 Answers

Sorry about this, I'm the culprit :)

Hopefully we will be able to push the next version soon. Not the time to mention this, but there is an upside to this accident: it means we can definitely kill the current serial form of BF and its related supporting code (which was an accident itself), which I'm trying to fix for a month now - incidentally the fix to that also fixes this problem.

Edit: more information here (and in Louis' filed issue)

like image 58
Dimitris Andreou Avatar answered Sep 28 '22 09:09

Dimitris Andreou