Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

incremental way of counting quantiles for large set of data

I need to count the quantiles for a large set of data.

Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:

List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix) 
{
    allData.AddRange(row);
}

allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];

I would like to find a way of obtaining the quantile without storing the data in an intermediate variable. The best solution would be to count some parameters of mid-results for first row and then adjust it step by step for next rows.

Note:

  • These datasets are really big (ca 5000 elements in each row)
  • The Q3 can be estimated, it doesn't have to be an exact value.
  • I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!

This question is similar to “On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis, but I need to count quantiles.

ALso there are few articles in this topic, i.e.:

  • An Efficient Algorithm for the Approximate Median Selection Problem
  • Incremental quantile estimation for massive tracking

Before trying to implement these approaches, I wondered if there are maybe any other, quicker ways of counting the 0.25/0.75 quantiles?

like image 965
Gacek Avatar asked May 14 '10 20:05

Gacek


People also ask

How are Quantiles calculated?

Quantiles of a population. Pr[X ≤ x] ≥ k/q. For a finite population of N equally probable values indexed 1, …, N from lowest to highest, the k-th q-quantile of this population can equivalently be computed via the value of Ip = N k/q.

What is the k th Q-quantile of a dataset?

The kth q-quantile for a given data distribution is the value x such that at most k/q of the data values are less than x and at most (q − k)/q of the data values are more than x, where k is an integer such that 0 < k < q. There are q − 1 q-quantiles.

What is approximate quantile?

A ǫ-approximate φ- quantile is any element whose rank is between r − ǫN and r + ǫN after sort, where r = ⌊φN⌋. For example, we want to calculate 0.1-approximate 0.3-quantile of the dataset 11, 21, 24, 61, 81, 39, 89,56, 12, 51.


2 Answers

I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.

If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.

like image 92
Keith Randall Avatar answered Nov 06 '22 01:11

Keith Randall


There is a more recent and much simpler algorithm for this that provides very good estimates of the extreme quantiles.

The basic idea is that smaller bins are used at the extremes in a way that both bounds the size of the data structure and guarantees higher accuracy for small or large q. The algorithm is available in several languages and many packages. The MergingDigest version requires no dynamic allocation ... once the MergingDigest is instantiated, no further heap allocation is required.

See https://github.com/tdunning/t-digest

like image 21
Ted Dunning Avatar answered Nov 06 '22 03:11

Ted Dunning