I have a pool sets (with size of pool n), with all sets not fitting in the RAM. I can only fit only a small fraction, say 1-5% of all sets into RAM.
The problem is given query set Q I need to return top k sets with largest cardinality of intersect with Q.
k is small, in hundreds, while n in hundreds of millions. Total number of district elements in all sets in hundreds of millions, too.
I did some experiments representing sets as bloom filters. Because sets size varies greatly I have to use very large bloomfilters, which is inefficient (bloomfiltes take 5x space of original dataset). Adaptive bloomfiters from https://github.com/jaybaird/python-bloomfilter produce only 3-5x compression of the dataset, so this is still pretty much infeasible.
K-Minimum Values data structure is extremely memory efficient. Unlike Bloom filters, it does not provide membership test, only set-theoretic operations and cardinality estimate.
Might work for you, depending on cardinalities of your sets and your error tolerance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With