Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to calculate percentile in postgres

I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.

so far i tried this query

Select quartile
     , avg(data) 
     , max(data) 
  FROM (

        SELECT data
             , ntile(500) over (order by data) as quartile
          FROM data
       ) x
 GROUP BY quartile
 ORDER BY quartile

but how do i have find the percentile

like image 840
lampdev Avatar asked Jan 11 '15 04:01

lampdev


People also ask

How to calculate 95th percentile in PostgreSQL?

How to Calculate 95th Percentile in PostgreSQL. Similarly, you can calculate 95th percentile in PostgreSQL, using percentile_cont. postgres=# SELECT PERCENTILE_CONT ( 0.95) WITHIN GROUP (ORDER BY sale) FROM sales; percentile_cont ----------------- 277.5. You can also calculate 95th percentile in PostgreSQL using percentile_disc.

How do I use the percentile function within a query?

The percentile function can be used multiple times within the query. In this example, 3 quartiles are calculated. within group can be used with other claues like group by. Here’s an example where different percentiles are calculated for numbers less than 100, and everything else. group by 1 groups by the first expression in the select statement

How do I get percentile data from multiple histograms?

Most SQL databases have percentile functions built into them, like PERCENTILE_DISC and PERCENTILE_CONT. You can also use tdigest to do these types of functions. It can take basic doubles to do this, or can you query across multiple tidgest histograms. The t-digest extension allows you to use tdigest as a special column data type.

What is the difference between percentile_Cont and percentile_disc?

percentile_disc will return a value from the input set closest to the percentile you request percentile_cont will return an interpolated value between multiple values based on the distribution. You can think of this as being more accurate, but can return a fractional value between the two values from the input


2 Answers

Edit:

Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc and percentile_cont here. These were introduced in 9.4.

Original Answer:

ntile is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.).

ntile groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.

For percentile, you would set the number of buckets to be 100.

I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile with an argument of 100, and the groups would have more than 500 entries.

If you don't care about avg nor max, you can drop a bunch from your query. So it would look something like this:

SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data
like image 102
khampson Avatar answered Sep 20 '22 15:09

khampson


Usually, if you want to know the percentile, you are safer using cume_dist than ntile. That is because ntile behaves strangely when given few inputs. Consider:

=# select v, 
          ntile(100) OVER (ORDER BY v),
          cume_dist() OVER (ORDER BY v)
   FROM (VALUES (1), (2), (4), (4)) x(v);

 v | ntile | cume_dist 
---+-------+-----------
 1 |     1 |      0.25
 2 |     2 |       0.5
 4 |     3 |         1
 4 |     4 |         1

You can see that ntile only uses the first 4 out of 100 buckets, where cume_dist always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist under 0.99 and take the smallest v from what's left.

If you are on Postgres 9.4+, then percentile_cont and percentile_disc make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.

like image 44
Paul A Jungwirth Avatar answered Sep 20 '22 15:09

Paul A Jungwirth