incremental way of counting quantiles for large set of data

Q: What is approximate quantile?

A ǫ-approximate φ- quantile is any element whose rank is between r − ǫN and r + ǫN after sort, where r = &lfloor;φN&rfloor;. For example, we want to calculate 0.1-approximate 0.3-quantile of the dataset 11, 21, 24, 61, 81, 39, 89,56, 12, 51.

Tags:

algorithm

statistics

numerical-methods

quantile

I need to count the quantiles for a large set of data.

Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:

List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix) 
{
    allData.AddRange(row);
}

allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];

I would like to find a way of obtaining the quantile without storing the data in an intermediate variable. The best solution would be to count some parameters of mid-results for first row and then adjust it step by step for next rows.

Note:

These datasets are really big (ca 5000 elements in each row)
The Q3 can be estimated, it doesn't have to be an exact value.
I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!

This question is similar to “On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis, but I need to count quantiles.

ALso there are few articles in this topic, i.e.:

An Efﬁcient Algorithm for the Approximate Median Selection Problem
Incremental quantile estimation for massive tracking

Before trying to implement these approaches, I wondered if there are maybe any other, quicker ways of counting the 0.25/0.75 quantiles?

965

asked May 14 '10 20:05

Gacek

2 Answers

I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.

If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.

answered Nov 06 '22 01:11

Keith Randall

There is a more recent and much simpler algorithm for this that provides very good estimates of the extreme quantiles.

The basic idea is that smaller bins are used at the extremes in a way that both bounds the size of the data structure and guarantees higher accuracy for small or large q. The algorithm is available in several languages and many packages. The MergingDigest version requires no dynamic allocation ... once the MergingDigest is instantiated, no further heap allocation is required.

See https://github.com/tdunning/t-digest

answered Nov 06 '22 03:11

Ted Dunning

Related questions
                            
                                How to detect if the given graph has a cycle containing all of its nodes? Does the suggested algorithm have any flaws?
                            
                                What memory management algorithms are used by the major compiler vendors?
                            
                                Regular expression matching algorithm in Java
                            
                                Reducing time complexity in maximal minimum-sum 2-partitioning of an array
                            
                                Algorithm for generating random network
                            
                                Finding the furthest point in a grid when compared to other points
                            
                                Find the number of pairs where the first element is divisible by second element
                            
                                Solving a puzzle using search algorithms
                            
                                Unexpected NullReferenceException in F# Algorithm
                            
                                Rounding a list of values to the nearest value from another list in python
                            
                                Find a region with maximum sum of top-K points
                            
                                Combined multiply divide operation on 64bit signed integer without overflow
                            
                                Theoretically, what data structure can I use for trees with shared memory?
                            
                                Disjoint sets on apache spark
                            
                                Code in for-loops vs if-else statements
                            
                                LCOM4 interrogation about way to calculate
                            
                                Algorithm to find the largest inscribed chord of a closed polyline
                            
                                How to find the farthest point (from a set of points) from a given point efficiently?
                            
                                What are some good strategies for determining block size in a deflate algorithm?
                            
                                Longest Simple Path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With