I am trying to create a data structure for a fixed size set that should support the following operations:
In my case, the size of the set is likely to be very small (4-16 elements), but the lookups must be as fast as possible and read as few bits as possible. Also, it needs to be space efficient. Replacements (i.e. operation 2) are likely to be few. I looked into the following options:
Additional Information:
Any ideas on how to solve this?
Cuckoo Filter is an option that should be considered. To quote their abstract
Cuckoo Filter: Practically Better Than Bloom
We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set member-ship tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, **cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters out-perform previous data structures that extend Bloom filters to support deletions substantially in both time and space.
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
Well, note the following:
Using standard hash table, with a descent hash function (since it is numbers, there are bunch of standard hash functions) with 4|S| entries will require on average less then 2 look-ups (assuming unbiased numbers as input), though it might deteriorate to terrible worst case of 4|S|. Of course you can bound it as follows:
- If number of cells searched exceeds k
- abort and return true (will cause FP at some probability that you can caclculate, and will give faster worst case performance).
Regarding counting bloom filters - this is the way to do it, IMO. Note that a bloom filter (standard) requires 154 bits to have FP probability of 1%, or 100 bits to have FP probability of 5%. (*)
So, if you need 4 times this number, you get 616 bits / 400 bits, Note that in most modern machine this is small enough to fill a few CPU-Cache blocks, which means (depending on the machine) - reading all these bits could really take less then 10 cycles on some machines.
IMO you cannot do anything to beat it without getting much higher FP rate.
(*) Calculated according to:
m = n ln(p) / ln(2)2
P.S. If you can guarantee each element is removed at most once, you can use a variation of bloom filter with double space instead that has slightly better FP, but also has some FNs, by simply using 2 bloom filters: 1 for regular
and 1 for deleted
. An element is in the set if it is in regular
and NOT in deleted
.
This improves FP rate at the expense of having also FNs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With