Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying HyperLogLog to a sample of the population

The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the cardinality of a set using only a tiny amount of memory. However, it does take into account all N elements of the original set in the calculation. What if we had access to only a small random sample (say, 10%) of the original N? Has there been any research on how HyperLogLog or similar algorithms can be adapted to this situation?

I am aware that this is essentially the problem described as distinct value estimation, for which abundant research exists (see for example this paper for an overview). However, the research on the distinct value estimation that I'm aware of uses a number of ad-hoc estimators very different from the approach used by HyperLogLog. Therefore, I am wondering if someone has already thought of adapting HyperLogLog to the distinct value estimation problem.

like image 356
Jon Smark Avatar asked Nov 25 '12 16:11

Jon Smark


People also ask

When should I use HyperLogLog?

A HyperLogLog is a probabilistic data structure used to count unique values — or as it's referred to in mathematics: calculating the cardinality of a set. These values can be anything: for example, IP addresses for the visitors of a website, search terms, or email addresses.

How does HyperLogLog work?

In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.

How do you determine the cardinality of a snowflake?

The following Aggregate Functions are provided for estimating cardinality using HyperLogLog: HLL: Returns an approximation of the distinct cardinality of the input. HLL_ACCUMULATE: Skips the final estimation step and returns the HyperLogLog state at the end of an aggregation.


2 Answers

However, the research on the distinct value estimation that I'm aware of uses a number of ad-hoc estimators very different from the approach used by HyperLogLog.

Yes, because they are solving a very different problem.

Suppose you just confiscated a stash of 1.000.000 counterfeit dollar bills, and you want to know the number of distinct serial numbers.

Sampling 100.000 of them (using HyperLogLog, as your antique steam-driven counting machine has only 1k memory) you count 5000 different serial numbers, each of which occurs somewhere around 20 times. Then you can be pretty sure that the whole stash will contain only a little over 5000 distinct serial numbers.

Now suppose that 1 serial number occurs 95.001 times, and 4999 serial numbers occur only once. Apparently some bona fide bank notes found their way into your stash. Now you can be pretty confident that the stash contains around 5% honest banknotes, so that the entire stash contains around 50.000 distinct serial numbers

Note that the distribution of the frequencies in your sample is used to infer something about the distribution in the entire stash. This is actually mentioned as one of the "ad hoc" (your words) methods in the second paper you cite ("Sampling-based estimation of the number of distinct values(..)"):

The idea behind a parametric estimator is to fit a probability distribution to the observed relative frequencies of the different attribute values.

Also note that the results of HyperLogLog and similar methods are completely insensitive to the distribution of the samples over their values. But your final estimate evidently depends very much on it!

My advice: use method of your choice (like HyperLogLog) to count the number of distinct values in your sample, and then use one of the methods in "Sampling-based estimation" to estimate the number of values in your entire multiset , or use your prior knowledge abut the distribution of the multiset to calculate an estimate (maybe you saw the counterfeiters' printing press, and you know it could only ever print one serial number)

like image 119
Hans Lub Avatar answered Sep 29 '22 13:09

Hans Lub


Citation search is a wonderful thing. I'm not super familiar with the two problems as posed, so this paper might not be exactly what you meant. At the least they certainly talk about HyperLogLog and its relationship to the problem, so maybe it will sate your curiosity.

An Optimal Algorithm for the Distinct Elements Problem

like image 45
FoolishSeth Avatar answered Sep 29 '22 11:09

FoolishSeth