Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast approximate algorithm for cardinality of sets intersection

I have a pool sets (with size of pool n), with all sets not fitting in the RAM. I can only fit only a small fraction, say 1-5% of all sets into RAM.

The problem is given query set Q I need to return top k sets with largest cardinality of intersect with Q.

  1. Assuming Q in from the same pool of sets.
  2. For general Q.

k is small, in hundreds, while n in hundreds of millions. Total number of district elements in all sets in hundreds of millions, too.

  • There are many probabilistic data structures, KMV, MinHash and it's variants, which one should I use?
  • Can I modify HyperLogLog for my task?
  • Which of those structures can be assembled into some kind of index?

I did some experiments representing sets as bloom filters. Because sets size varies greatly I have to use very large bloomfilters, which is inefficient (bloomfiltes take 5x space of original dataset). Adaptive bloomfiters from https://github.com/jaybaird/python-bloomfilter produce only 3-5x compression of the dataset, so this is still pretty much infeasible.

like image 969
Moonwalker Avatar asked Jun 04 '16 11:06

Moonwalker


1 Answers

K-Minimum Values data structure is extremely memory efficient. Unlike Bloom filters, it does not provide membership test, only set-theoretic operations and cardinality estimate.

Might work for you, depending on cardinalities of your sets and your error tolerance.

like image 173
Vlad Shcherbina Avatar answered Nov 11 '22 11:11

Vlad Shcherbina