I was using BloomFilter in guava v.11.0.1 and it seems like I am getting an exception when my insertion is large. I tried at 10 million with 0.001 fpp, and it failed.
java.lang.IllegalArgumentException: Number of bits must be positive
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at com.google.common.hash.BloomFilterStrategies.checkPositiveAndMakeMultipleOf64(BloomFilterStrategies.java:72)
at com.google.common.hash.BloomFilterStrategies.access$000(BloomFilterStrategies.java:18)
at com.google.common.hash.BloomFilterStrategies$From128ToN.withBits(BloomFilterStrategies.java:37)
at com.google.common.hash.BloomFilter.create(BloomFilter.java:192)
at com.ipg.collection.BloomFilterWritable.impl(BloomFilterWritable.java:43)
at com.ipg.collection.BloomFilterWritable.put(BloomFilterWritable.java:62)
at com.ipg.prophet.twitter.twitflow.archive.UnzipTweetsProcessAndUpload$ProcessorConsumer.process(UnzipTweetsProcessAndUpload.java:107)
at com.ipg.prophet.twitter.twitflow.archive.UnzipTweetsProcessAndUpload$ProcessorConsumer.run(UnzipTweetsProcessAndUpload.java:84)
at java.lang.Thread.run(Thread.java:662)
I think at least it should support that many insertions with such a high fpp, shouldn't it?
Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision. Without hash collision, it would not be compact.
Bloom filters do not store the items themselves and they use less space than the lower theoretical limit required to store the data correctly, and therefore, they exhibit an error rate. They have false positives but they do not have false negatives, and the one-sidedness of this error can be turned to our benefit.
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
The probability of a false positive – or false positive rate – of a Bloom filter is a function of the randomness of the values generated by the hash functions and of m, n, and k (n is the number of objects mapped into the Bloom filter).
Sorry about this, I'm the culprit :)
Hopefully we will be able to push the next version soon. Not the time to mention this, but there is an upside to this accident: it means we can definitely kill the current serial form of BF and its related supporting code (which was an accident itself), which I'm trying to fix for a month now - incidentally the fix to that also fixes this problem.
Edit: more information here (and in Louis' filed issue)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With