Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast algorithm for repeated calculation of percentile?

People also ask

What is the formula for calculating percentiles?

Percentiles can be calculated using the formula n = (P/100) x N, where P = percentile, N = number of values in a data set (sorted from smallest to largest), and n = ordinal rank of a given value. Percentiles are frequently used to understand test scores and biometric measurements.

Can you interpolate percentiles?

To calculate an interpolated percentile, do the following: Calculate the rank to use for the percentile. Use: rank = p(n+1), where p = the percentile and n = the sample size. For our example, to find the rank for the 70th percentile, we take 0.7*(11 + 1) = 8.4.


You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

  1. Adding element.

See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.

  1. Finding "0.75 median"

Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).


A simple Order Statistics Tree is enough for this.

A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.

If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.

Most standard implementations (like Java's TreeMap) are order statistic trees.


If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.

For each new value, add it to the appropriate bin. Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.

This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size. (use bin_size appropriate to your user case).

I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.