Fast approximate algorithm for cardinality of sets intersection

Question

I have a pool sets (with size of pool n), with all sets not fitting in the RAM. I can only fit only a small fraction, say 1-5% of all sets into RAM.

The problem is given query set Q I need to return top k sets with largest cardinality of intersect with Q.

Assuming Q in from the same pool of sets.
For general Q.

k is small, in hundreds, while n in hundreds of millions. Total number of district elements in all sets in hundreds of millions, too.

There are many probabilistic data structures, KMV, MinHash and it's variants, which one should I use?
Can I modify HyperLogLog for my task?
Which of those structures can be assembled into some kind of index?

I did some experiments representing sets as bloom filters. Because sets size varies greatly I have to use very large bloomfilters, which is inefficient (bloomfiltes take 5x space of original dataset). Adaptive bloomfiters from https://github.com/jaybaird/python-bloomfilter produce only 3-5x compression of the dataset, so this is still pretty much infeasible.

Vlad Shcherbina · Accepted Answer

K-Minimum Values data structure is extremely memory efficient. Unlike Bloom filters, it does not provide membership test, only set-theoretic operations and cardinality estimate.

Might work for you, depending on cardinalities of your sets and your error tolerance.

Fast approximate algorithm for cardinality of sets intersection

Tags:

language-agnostic

algorithm

indexing

data-structures

hash

Moonwalker

1 Answers

Vlad Shcherbina

Recent Activity

Donate For Us

Fast approximate algorithm for cardinality of sets intersection

Tags:

language-agnostic

algorithm

indexing

data-structures

hash

Moonwalker

1 Answers

Vlad Shcherbina

Related questions

Recent Activity

Donate For Us