Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bloom filter usage

I am struggling to understand the usefulness of the bloom filter. I get its underlying logic, space compaction, fast lookups, false positives etc. I just cannot put that concept into a real-life situation as being beneficial. One frequent application is use of bloom filters in web caching. We use bloom filter to determine whether a given URL is in the cache or not. Why don't we simply access the cache to determine that? If we get a yes, we still need to go to cache to retrieve the webpage (which might not be there), but in case of a no, we could have got the same answer using the cache (which is probably optimized for fast lookups anyway?).

like image 384
Bober02 Avatar asked Jan 18 '13 16:01

Bober02


People also ask

When should I use Bloom filter?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It is used where we just need to know the element belongs to the object or not.

What is a Bloom filter how it is used?

A Bloom filter, named after its inventor Burton Howard Bloom, is a data structure that can be used to perform a cheap test for the potential presence of a particular value, in a way that is much faster than looking up the value in an index, requiring much less storage than the index would.

How is Bloom filter used in Cassandra?

Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row. Cassandra uses Bloom filters to determine whether an SSTable has data for a particular partition. Bloom filters are unused for range scans, but are used for index scans.

How is Bloom filter space efficient?

The advantage of a Bloom filter over the established dictionary structures is space efficiency. A Bloom filter needs only a constant number of bits per (prospective) element, while keeping the FPR constant, independent of the size of the elements' universe.


1 Answers

Bloom filters are designed for situations where a false negative is a Very Bad Thing and a false positive is acceptable.

For example, suppose that you are making a web browser and have a known blacklist of scam websites. Your blacklist is massive - in the hundreds of gigabytes - so you can't ship it with the browser. However, you can store it on your own servers. In that case, you could ship the browser with a Bloom filter of an appropriate size that holds all the URLs. Before visiting a site, you look it up in the filter. Then, if you get a "no" answer, you're guaranteed that the URL is not blacklisted and can just visit the site. If you get a "yes" answer, the site might be evil, so you can have the browser call up your main server to get the real answer. The fact that you can save a huge number of calls to the server here without ever sacrificing accuracy is important.

The cache idea is similar to this setup. You can query the filter to see if the page is in the cache. If you get a "no" answer, you're guaranteed it's not cached and can do an expensive operation to pull the data from the main source. Otherwise, you can then check the cache to see if it really is there. In rare instances you might need to check the cache, see that it isn't there, then pull from the main source, but you will never accidentally miss something really in cache.

Hope this helps!

like image 55
templatetypedef Avatar answered Sep 30 '22 03:09

templatetypedef