I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.
so far i tried this query
Select quartile
, avg(data)
, max(data)
FROM (
SELECT data
, ntile(500) over (order by data) as quartile
FROM data
) x
GROUP BY quartile
ORDER BY quartile
but how do i have find the percentile
How to Calculate 95th Percentile in PostgreSQL. Similarly, you can calculate 95th percentile in PostgreSQL, using percentile_cont. postgres=# SELECT PERCENTILE_CONT ( 0.95) WITHIN GROUP (ORDER BY sale) FROM sales; percentile_cont ----------------- 277.5. You can also calculate 95th percentile in PostgreSQL using percentile_disc.
The percentile function can be used multiple times within the query. In this example, 3 quartiles are calculated. within group can be used with other claues like group by. Here’s an example where different percentiles are calculated for numbers less than 100, and everything else. group by 1 groups by the first expression in the select statement
Most SQL databases have percentile functions built into them, like PERCENTILE_DISC and PERCENTILE_CONT. You can also use tdigest to do these types of functions. It can take basic doubles to do this, or can you query across multiple tidgest histograms. The t-digest extension allows you to use tdigest as a special column data type.
percentile_disc will return a value from the input set closest to the percentile you request percentile_cont will return an interpolated value between multiple values based on the distribution. You can think of this as being more accurate, but can return a fractional value between the two values from the input
Edit:
Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc
and percentile_cont
here. These were introduced in 9.4.
Original Answer:
ntile
is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.).
ntile
groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.
For percentile, you would set the number of buckets to be 100.
I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile
with an argument of 100, and the groups would have more than 500 entries.
If you don't care about avg
nor max
, you can drop a bunch from your query. So it would look something like this:
SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data
Usually, if you want to know the percentile, you are safer using cume_dist
than ntile
. That is because ntile
behaves strangely when given few inputs. Consider:
=# select v,
ntile(100) OVER (ORDER BY v),
cume_dist() OVER (ORDER BY v)
FROM (VALUES (1), (2), (4), (4)) x(v);
v | ntile | cume_dist
---+-------+-----------
1 | 1 | 0.25
2 | 2 | 0.5
4 | 3 | 1
4 | 4 | 1
You can see that ntile
only uses the first 4 out of 100 buckets, where cume_dist
always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist
under 0.99 and take the smallest v
from what's left.
If you are on Postgres 9.4+, then percentile_cont
and percentile_disc
make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With