Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast space efficient data structure for set membership queries on small sets

I am trying to create a data structure for a fixed size set that should support the following operations:

  1. Query whether an element is in the set (false positives are ok, false negatives are not)
  2. Replace one element of the set with another element

In my case, the size of the set is likely to be very small (4-16 elements), but the lookups must be as fast as possible and read as few bits as possible. Also, it needs to be space efficient. Replacements (i.e. operation 2) are likely to be few. I looked into the following options:

  1. Bloom Filters: This is the standard solution. However, it is difficult to delete elements and as such difficult to implement operation 2.
  2. Counting Bloom Filters: The space requirement becomes much higher (~ 3-4x) of that of the standard Bloom filter for no decrease in false +ve rates.
  3. Simply storing a list of hashes of all the elements: Gives better false +ve rates than counting bloom filter for similar space requirements, but is expensive to look up (in worst case all bits will be looked up).
  4. Previous idea with perfect hashing for location: I don't have an idea about fast perfect hashes for small sets of elements.

Additional Information:

  • The elements are 64 bit numbers.

Any ideas on how to solve this?

like image 573
Subhasis Das Avatar asked Feb 15 '23 00:02

Subhasis Das


2 Answers

Cuckoo Filter is an option that should be considered. To quote their abstract

Cuckoo Filter: Practically Better Than Bloom

We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set member-ship tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, **cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters out-perform previous data structures that extend Bloom filters to support deletions substantially in both time and space.

https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

like image 167
Peter Rindal Avatar answered May 19 '23 05:05

Peter Rindal


Well, note the following:

  1. Using standard hash table, with a descent hash function (since it is numbers, there are bunch of standard hash functions) with 4|S| entries will require on average less then 2 look-ups (assuming unbiased numbers as input), though it might deteriorate to terrible worst case of 4|S|. Of course you can bound it as follows:

    - If number of cells searched exceeds k - abort and return true (will cause FP at some probability that you can caclculate, and will give faster worst case performance).

  2. Regarding counting bloom filters - this is the way to do it, IMO. Note that a bloom filter (standard) requires 154 bits to have FP probability of 1%, or 100 bits to have FP probability of 5%. (*)
    So, if you need 4 times this number, you get 616 bits / 400 bits, Note that in most modern machine this is small enough to fill a few CPU-Cache blocks, which means (depending on the machine) - reading all these bits could really take less then 10 cycles on some machines.
    IMO you cannot do anything to beat it without getting much higher FP rate.

(*) Calculated according to:

m = n ln(p) / ln(2)2

P.S. If you can guarantee each element is removed at most once, you can use a variation of bloom filter with double space instead that has slightly better FP, but also has some FNs, by simply using 2 bloom filters: 1 for regular and 1 for deleted. An element is in the set if it is in regular and NOT in deleted.
This improves FP rate at the expense of having also FNs

like image 31
amit Avatar answered May 19 '23 04:05

amit