Best Method to Intersect Huge HyperLogLogs in Redis

Tags:

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use elsewhere.

Two Strategies

There are two strategies, one of which seems vastly simpler. I've looked at the actual Redis source and I'm having a bit of trouble (not big in C, myself) figuring out whether it's better from a precision and efficiency perspective to use their built-in structures/routines or develop my own. For what it's worth, I'm willing to sacrifice space and to some degree errors (stdev +-2%) in the pursuit of efficiency with extremely large sets.

1. Inclusion Principle

By far the simplest of the two--essentially I would just use the lossless union (PFMERGE) in combination with this principle to calculate an estimate of the overlap. Tests seem to show this running reliably in many cases, although I'm having trouble getting an accurate handle on in-the-wild efficiency and accuracy (some cases can produce errors of 20-40% which is unacceptable in this use case).

Basically:

aCardinality + bCardinality - intersectionCardinality

or, in the case of multiple sets...

aCardinality + (bCardinality x cCardinality) - intersectionCardinality

seems to work in many cases with good accuracy, but I don't know if I trust it. While Redis has many built-in low-cardinality modifiers designed to circumvent known HLL issues, I don't know if the issue of wild inaccuracy (using inclusion/exclusion) is still present with sets of high disparity in size...

2. Jaccard Index Intersection/MinHash

This way seems more interesting, but a part of me feels like it may computationally overlap with some of Redis' existing optimizations (ie, I'm not implementing my own HLL algorithm from scratch).

With this approach I'd use a random sampling of bins with a MinHash algorithm (I don't think an LSH implementation is worth the trouble). This would be a separate structure, but by using minhash to get the Jaccard index of the sets, you can then effectively multiply the union cardinality by that index for a more accurate count.

Problem is, I'm not very well versed in HLL's and while I'd love to dig into the Google paper I need a viable implementation in short order. Chances are I'm overlooking some basic considerations either of Redis' existing optimizations, or else in the algorithm itself that allows for computationally-cheap intersection estimates with pretty lax confidence bounds.

thus, my question:

How do I most effectively get a computationally-cheap intersection estimate of N huge (billions) sets, using redis, if I'm willing to sacrifice space (and to a small degree, accuracy)?

528

asked May 07 '15 16:05

Julian

2 Answers

Read this paper some time back. Will probably answer most of your questions. Inclusion Principle inevitably compounds error margins a large number of sets. Min-Hash approach would be the way to go.

http://tech.adroll.com/media/hllminhash.pdf

159

answered Nov 02 '22 00:11

frugalcoder

There is a third strategy to estimate the intersection size of any two sets given as HyperLogLog sketches: Maximum likelihood estimation.

For more details see the paper available at http://oertl.github.io/hyperloglog-sketch-estimation-paper/.

answered Nov 02 '22 00:11

otmar

Related questions
                            
                                Storable.pm - corrupt when saving to non-truncated file
                            
                                Image Hash for very similar images [closed]
                            
                                C++: Suggestions about a hash function for a sequence of strings where the order of the strings is irrelevant
                            
                                Pre-hashed string keys for faster Python dictionaries lookup?
                            
                                Does md5 have any uniqueness guarantee for short strings (finite number of strings)?
                            
                                C# rhash generates hashes different than the rhash.exe and utorrent
                            
                                Perl DBI fetchall_hashref
                            
                                Is there a way to utilize Bcrypt for iOS development with Swift?
                            
                                How to define a Hash class for custom std::basic_string<> specialization class just like std::string?
                            
                                Web Application - Storing a Password
                            
                                Analyzing goals and choosing a good hash function
                            
                                How to create a hash table
                            
                                How to create a good hash_combine with 64 bit output (inspired by boost::hash_combine)
                            
                                Write to a CSV file from a hash perl
                            
                                Comparing SHA256 made with PHP hash() and NodeJS crypto.createHash()
                            
                                PHP array with default value for nonexisting indices
                            
                                Is it possible to calculate sha256 hashes in the browser using the user's video card, eg. by using WebGL or Flash?
                            
                                How does one retrieve the hash code of an enumeration without boxing it?
                            
                                Getting invalid android_key parameter error, after first time login in Facebook Android SDK

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best Method to Intersect Huge HyperLogLogs in Redis

Tags:

hashtable

redis

hash

hyperloglog

minhash