Fast algorithm for repeated calculation of percentile?

People also ask

What is the formula for calculating percentiles?

Percentiles can be calculated using the formula n = (P/100) x N, where P = percentile, N = number of values in a data set (sorted from smallest to largest), and n = ordinal rank of a given value. Percentiles are frequently used to understand test scores and biometric measurements.

Can you interpolate percentiles?

To calculate an interpolated percentile, do the following: Calculate the rank to use for the percentile. Use: rank = p(n+1), where p = the percentile and n = the sample size. For our example, to find the rank for the 70th percentile, we take 0.7*(11 + 1) = 8.4.

You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

Adding element.

See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.

Finding "0.75 median"

Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

A simple Order Statistics Tree is enough for this.

A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.

If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.

Most standard implementations (like Java's TreeMap) are order statistic trees.

If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.

For each new value, add it to the appropriate bin. Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.

This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size. (use bin_size appropriate to your user case).

I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.

Related questions
                            
                                Changing integer to binary string of digits
                            
                                How to calculate a standard deviation [array] [duplicate]
                            
                                Given a list of numbers and a number k, return whether any two numbers from the list add up to k
                            
                                Help me understand Inorder Traversal without using recursion
                            
                                How to change a negative number to zero in python without using decision structures
                            
                                Boyer Moore Algorithm Understanding and Example?
                            
                                How to efficiently rack up billiards for an 8-ball game?
                            
                                Is there an efficient algorithm for segmentation of handwritten text?
                            
                                What is a good open source B-tree implementation in C? [closed]
                            
                                How to optimally solve the flood fill puzzle?
                            
                                What's the fastest way to brush up on algorithms for a technical interview (on Monday)? [closed]
                            
                                Fast way of getting the dominant color of an image [closed]
                            
                                Real world applications of Binary heaps and Fibonacci Heaps [closed]
                            
                                How does heap compaction work quickly?
                            
                                Implementation of a work stealing queue in C/C++? [closed]
                            
                                Help Understanding Cross Validation and Decision Trees
                            
                                Bad implementation of Enumerable.Single?
                            
                                Grouping numbers based on occurrences?
                            
                                Levenshtein distance: how to better handle words swapping positions?
                            
                                Dividing a plane of points into two equal halves [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast algorithm for repeated calculation of percentile?

Tags:

algorithm

optimization

data-structures

percentile

People also ask

Recent Activity

Donate For Us