Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Method to Intersect Huge HyperLogLogs in Redis

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use elsewhere.

Two Strategies

There are two strategies, one of which seems vastly simpler. I've looked at the actual Redis source and I'm having a bit of trouble (not big in C, myself) figuring out whether it's better from a precision and efficiency perspective to use their built-in structures/routines or develop my own. For what it's worth, I'm willing to sacrifice space and to some degree errors (stdev +-2%) in the pursuit of efficiency with extremely large sets.

1. Inclusion Principle

By far the simplest of the two--essentially I would just use the lossless union (PFMERGE) in combination with this principle to calculate an estimate of the overlap. Tests seem to show this running reliably in many cases, although I'm having trouble getting an accurate handle on in-the-wild efficiency and accuracy (some cases can produce errors of 20-40% which is unacceptable in this use case).

Basically:

aCardinality + bCardinality - intersectionCardinality

or, in the case of multiple sets...

aCardinality + (bCardinality x cCardinality) - intersectionCardinality

seems to work in many cases with good accuracy, but I don't know if I trust it. While Redis has many built-in low-cardinality modifiers designed to circumvent known HLL issues, I don't know if the issue of wild inaccuracy (using inclusion/exclusion) is still present with sets of high disparity in size...

2. Jaccard Index Intersection/MinHash

This way seems more interesting, but a part of me feels like it may computationally overlap with some of Redis' existing optimizations (ie, I'm not implementing my own HLL algorithm from scratch).

With this approach I'd use a random sampling of bins with a MinHash algorithm (I don't think an LSH implementation is worth the trouble). This would be a separate structure, but by using minhash to get the Jaccard index of the sets, you can then effectively multiply the union cardinality by that index for a more accurate count.


Problem is, I'm not very well versed in HLL's and while I'd love to dig into the Google paper I need a viable implementation in short order. Chances are I'm overlooking some basic considerations either of Redis' existing optimizations, or else in the algorithm itself that allows for computationally-cheap intersection estimates with pretty lax confidence bounds.

thus, my question:

How do I most effectively get a computationally-cheap intersection estimate of N huge (billions) sets, using redis, if I'm willing to sacrifice space (and to a small degree, accuracy)?

like image 528
Julian Avatar asked May 07 '15 16:05

Julian


People also ask

When should I use HyperLogLog?

A HyperLogLog is a probabilistic data structure used to count unique values — or as it's referred to in mathematics: calculating the cardinality of a set. These values can be anything: for example, IP addresses for the visitors of a website, search terms, or email addresses.

Why use HyperLogLog?

HyperLogLog in Real-world ApplicationThis algorithm can estimate the number of unique values within a very large dataset using little memory and time. In fact, it can estimate cardinalities beyond 10⁹ with a 2% standard error while only using 1.5kb memory.

What is HyperLogLog in Redis?

Redis HyperLogLog is an algorithm that uses randomization in order to provide an approximation of the number of unique elements in a set using just a constant, and small amount of memory.


2 Answers

Read this paper some time back. Will probably answer most of your questions. Inclusion Principle inevitably compounds error margins a large number of sets. Min-Hash approach would be the way to go.

http://tech.adroll.com/media/hllminhash.pdf

like image 159
frugalcoder Avatar answered Nov 02 '22 00:11

frugalcoder


There is a third strategy to estimate the intersection size of any two sets given as HyperLogLog sketches: Maximum likelihood estimation.

For more details see the paper available at http://oertl.github.io/hyperloglog-sketch-estimation-paper/.

like image 41
otmar Avatar answered Nov 02 '22 00:11

otmar