I understand what makes bloom filters an attractive data structure; however, I'm finding it difficult to really understand when you can use them since you still have to perform the expensive operation you're trying to avoid to be certain that you haven't found a false positive. Because of this wouldn't they generally just add a lot of overhead? For example the wikipedia article for bloom filters suggests they can be used for data synchronization. I see how it would be great the first time around when the bloom filter is empty but say you haven't changed anything and you go to synchronize your data again. Now every lookup to the bloom filter will report that the file has already been copied but wouldn't we still have to preform the slower lookup task we were trying to avoid to actually make sure that's correct?
Basically, you use Bloom filters to avoid the long and arduous task of proving an item doesn't exist in the data structure. It's almost always harder to determine if something is missing than if it exists, so the filter helps to shore up losses searching for things you won't find anyway. It doesn't always work, but when it does you reap a huge benefit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With