Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bloom filter or cuckoo hashing?

Which do you prefer and why?

They both can be used to accomplish similar tasks but I'm curious as to see what people have used in actual applications and their reasoning for doing so.

like image 655
J B Avatar asked May 15 '09 05:05

J B


People also ask

Is a Bloom filter a hash table?

In hash table the object gets stored to the bucket(index position in the hashtable) the hash function maps to. Bloom filters doesn't store the associated object. It just tells whether it is there in the bloom filter or not. Hash tables are less space efficient.

What is a Bloom filter hash?

A bloom filter is a probabilistic data structure that is based on hashing. It is extremely space efficient and is typically used to add elements to a set and test if an element is in a set. Though, the elements themselves are not added to a set. Instead a hash of the elements is added to the set.

What is Bloom filter good for?

Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives).

How does a cuckoo filter work?

A cuckoo filter is a compact variant of a cuckoo hash table [21] that stores only fingerprints—a bit string derived from the item using a hash function—for each item inserted, instead of key-value pairs. The filter is densely filled with fingerprints (e.g., 95% entries occupied), which confers high space efficiency.


3 Answers

Bloom filters and Cuckoo filters are used in similar situations but there's a lot of differences underneath that usually determine which is a better choice.

Bloom filters are used internally in database engines, notably Apache Cassandra. The reasons are as other posters said, to reduce the cost of slow set operations. Basically, any "does this maybe or definitely not exist" operation with a high cost can use a Bloom filter to reduce the number of checks done.

Another common example with today's SaaS model would be a remote REST service with a cost-per-call. Any API call with a binary answer such as "is this address INVALID" can use a bloom filter to eliminate over 90% of duplicate queries! Note that since Bloom and Cuckoo filters have false positives they are NOT useful for the inverse operation "is this address VALID"

Important to remember is that Bloom and Cuckoo filters have NO false negatives. This makes these filters useful for checks like "is this definitely not or maybe spam" but not useful for operations where false positives are unacceptable, like checking user permissions. In this aspect they can be conceptually considered the opposite of a cache. Both Bloom/Cuckoo filter and caches are used primarily to reduce the cost of expensive operations with a boolean answer, except caches have no false positives and Bloom/Cuckoo have no false negatives.

Notable differences between Cuckoo/Bloom include:

  • Combination. Bloom filters can be efficiently merged as long as they are created with the same parameters. Both quickly and with little bandwidth. This is why you see them used frequently in massively distributed systems, exchanging Bloom filters is fast. Cuckoo filters are not easily composable, making them less useful in these circumstances.

  • False positive rate. Cuckoo filters are more space efficient. Many use cases for both structures are focused on low level networking. On weak hardware the ~40% higher efficiency of Cuckoo filters for the same false positive rate can be important. The reference implementation, in c++, sorts items within each bucket for additional space savings, taking advantage of an item's position within a bucket to store smaller fingerprints. The additional libraries I'll mention later (including mine) don't seem to do this. If anyone ever uses my library I might add it :).

  • Constant false positive rate. Bloom filters have asymptotically worse false positive rates as they surpass their designed size. You can keep inserting items forever, but eventually your false positive rate will be nearly 100%. Cuckoo filters, being based on Cuckoo hashing, have a set capacity where insertions will actually fail. Repeat insertion of non-random item hashes can cause Cuckoo filters to fail insertion, possibly far before their designed fill level.

  • Speed. This is subjective and depends a lot on the hardware, but Cuckoo filters are generally faster in the average case (in my experience). Most Bloom filter designs run a hash function twice. When using secure hash functions especially, this can be a big handicap compared to Cuckoo filters which only hash inserted items once. The code I've seen uses various hashing functions for Bloom and Cuckoo filters. Google's Guava Bloom uses Murmur3, many other implementations use SHA1 or something else. If hash collisions can be exploited for you use case, make sure the library uses a secure hash. Important to know is that Bloom filters take roughly constant time to insert while Cuckoo filters have a constant-time AVERAGE case. As a Cuckoo filters get within a few percent of capacity insert speeds slow down greatly. Even then, only the insert speed slows, all other operations are constant average time.

  • Flexibility. Bloom filters only support insert and contains. Cuckoo filters additionally support deletion and limited counting. In the reference design, Cuckoo filters can determine how many times an item was inserted, up to 7 times. Bloom filters can only determine yes-no. Cuckoo filters also support deleting inserted items, a big positive in a lot of use cases compared to Bloom. When using Bloom filters, it is pretty standard to recreate the filter from scratch when it is "full" (estimated false positive rate exceeds threshold) since you can't delete old items. Note that rebuilding the filter still happens with Cuckoo filters when inserts start to fail, so depending on the use case this might be moot. In certain situations Cuckoo filters are more useful since you can delete items to stay within filter limits instead of rebuilding.

  • Support. Cuckoo filters are new and stable libraries for many languages simply don't exist.

The biggest advantage of Bloom filters is that they have more mature library support in most languages. The math behind Bloom filters is also better understood by scientists. Most of the characteristics of Cuckoo filters have been determined empirically, whereas Bloom filters have a solid numerical basis. This excludes Cuckoo filters for realtime and critical systems that must have verification of their performance, even though experimental evidence shows that Cuckoo filters perform better in most circumstances.

Shameless Plug: I'm the developer of a Cuckoo filter library for Java. CuckooFilter4J . It is missing the bucket semi-sort used in the paper so the space efficiency is somewhat lower than reference implementation. In the project readme I have links to other implementations I'm aware of. Which structure is better depends on your use case, but mostly on whether a solid Cuckoo filter implementation exists for your language.

You should definitely take a look at the source before you use a Cuckoo/Bloom filter in production. I read through various libs before writing my own... many of them had silent size limits due to 32-bit underlying arrays or obvious performance problems. Most had zero tests. Google's Guava Bloom implementation had by far the best code quality and tests (and supports 64 bit array limits). The only shortcomings with Guava's Bloom is that it doesn't have an option to use a secure hash function and isn't multi-threaded.

In a production system you might want multi-threading for speed. The answer for Guava's Bloom is to make a different filter for each thread and combine them occasionally. Since Cuckoo filters can't be combined, I added concurrent threading to my Cuckoo filter library. The other one's I'm aware of aren't thread safe or aren't concurrent.

like image 96
Mark Gunlogson Avatar answered Oct 21 '22 08:10

Mark Gunlogson


Which do you prefer, wine or cheese?

A bloom filter is for when you have limited space, high query cost, and mostly negative queries.
In that case, a bloom filter with 8 bits per key and 4 hash functions gives you 2.5% false positive rate; you process queries nearly 40 times faster than before, at the cost of 1 byte per key.

On the other hand, if any of the previous conditions do not hold, a hash table acting as a cache makes sense, though it obviously will take a lot more than one byte per entry :-)

You can even skip over the hard edge cases of cuckoo hashing if it's a cache. That also makes the size-increase problems of cuckoo hash tables (or anything other than linear hash) moot.

like image 10
Mischa Avatar answered Oct 21 '22 09:10

Mischa


Cuckoo Filter.

"Cuckoo Filter: Practically Better than Bloom." Bin Fan, David Andersen, Michael Kaminsky, Michael Mitzenmacher CoNext 2014. http://dx.doi.org/10.1145/2674005.2674994

From one of the authors' blog:

Let me describe a cuckoo filter and some of what's in the paper for you. If you want to avoid a technical discussion, all you need to know is that for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do). If you want to look at code, there's even a github repository for you with code for cuckoo filters.

like image 5
alexei Avatar answered Oct 21 '22 09:10

alexei