Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any HyperLogLog like structure for multiple multisets?

HyperLogLog estimates cardinality of a multiset. Is it possible to extend it to handle multiple multisets? Like, instead of just supporting query estimateCardinality(), it would support estimateCardinality(multiset_id). I'm trying to avoid having a dictionary of HyperLogLog values for each multiset_id.

Is there another way (data structure) to achieve this?

like image 973
kafka Avatar asked Nov 01 '22 01:11

kafka


1 Answers

The following idea might help out when you have a largish number of multiesets with high-variance in their cardinalities; that is, some have large size, and some have small size. It does not require you to estimate in advance which will be small and which will be large.

You can build a Linear Probabilistic Counter, with a small change. The original data structure has a (logical) boolean in each position. Here, each position would itself be a classial set. Instead of setting a bit on an

insert(element) 

op if it falls in this position, you would insert id into the set on an

insert(element, id)

There are some common-sense tricks you ould do to conserve space. E.g., you could decide that if id appears in over a certain fraction of the bins, then it is not stored in bin sets, but rather in a separate bitmap over all bins.

Overall, then, if you have both small and large sets, you would end up with the following:

  • a bitmap for each large set (this is the same cost per item for your dictionary of counters idea)

  • entries in some of the bits' sets for each small set (possibly much smaller than your dictionary of counters idea)

Since the data structure can switch, for a specific multiset, from the latter to the former - it might save space relative to the dictionary of counters idea, which might be considered a premature pessimization.

YMMV.

like image 134
Ami Tavory Avatar answered Nov 11 '22 17:11

Ami Tavory