I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average. so far i tried this query <pre class="prettyprint"><code>Select quartile , avg(data) , max(data) FROM ( SELECT data , ntile(500) over (order by data) as quartile FROM data ) x GROUP BY quartile ORDER BY quartile </code></pre> but how do i have find the percentile

Edit: Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See <code>percentile_disc</code> and <code>percentile_cont</code> here. These were introduced in 9.4. Original Answer: <code>ntile</code> is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.). <code>ntile</code> groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile. For percentile, you would set the number of buckets to be 100. I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use <code>ntile</code> with an argument of 100, and the groups would have more than 500 entries. If you don't care about <code>avg</code> nor <code>max</code>, you can drop a bunch from your query. So it would look something like this: <pre class="prettyprint"><code>SELECT data, ntile(100) over (order by data) AS percentile FROM data ORDER BY data </code></pre>

Usually, if you want to know the percentile, you are safer using <code>cume_dist</code> than <code>ntile</code>. That is because <code>ntile</code> behaves strangely when given few inputs. Consider: <pre class="prettyprint"><code>=# select v, ntile(100) OVER (ORDER BY v), cume_dist() OVER (ORDER BY v) FROM (VALUES (1), (2), (4), (4)) x(v); v | ntile | cume_dist ---+-------+----------- 1 | 1 | 0.25 2 | 2 | 0.5 4 | 3 | 1 4 | 4 | 1 </code></pre> You can see that <code>ntile</code> only uses the first 4 out of 100 buckets, where <code>cume_dist</code> always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a <code>cume_dist</code> under 0.99 and take the smallest <code>v</code> from what's left. If you are on Postgres 9.4+, then <code>percentile_cont</code> and <code>percentile_disc</code> make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.

how to calculate percentile in postgres

Tags:

postgresql

postgis

I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.

so far i tried this query

Select quartile
     , avg(data) 
     , max(data) 
  FROM (

        SELECT data
             , ntile(500) over (order by data) as quartile
          FROM data
       ) x
 GROUP BY quartile
 ORDER BY quartile

but how do i have find the percentile

840

asked Jan 11 '15 04:01

lampdev

2 Answers

Edit:

Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc and percentile_cont here. These were introduced in 9.4.

Original Answer:

ntile is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.).

ntile groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.

For percentile, you would set the number of buckets to be 100.

I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile with an argument of 100, and the groups would have more than 500 entries.

If you don't care about avg nor max, you can drop a bunch from your query. So it would look something like this:

SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data

102

answered Sep 20 '22 15:09

khampson

Usually, if you want to know the percentile, you are safer using cume_dist than ntile. That is because ntile behaves strangely when given few inputs. Consider:

=# select v, 
          ntile(100) OVER (ORDER BY v),
          cume_dist() OVER (ORDER BY v)
   FROM (VALUES (1), (2), (4), (4)) x(v);

 v | ntile | cume_dist 
---+-------+-----------
 1 |     1 |      0.25
 2 |     2 |       0.5
 4 |     3 |         1
 4 |     4 |         1

You can see that ntile only uses the first 4 out of 100 buckets, where cume_dist always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist under 0.99 and take the smallest v from what's left.

If you are on Postgres 9.4+, then percentile_cont and percentile_disc make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.

answered Sep 20 '22 15:09

Paul A Jungwirth

Related questions
                            
                                Configure query/command timeout with sqlalchemy create_engine?
                            
                                PostgreSql: select only weekends
                            
                                Create database in Knex migration
                            
                                Example use of ASSERT with PostgreSQL
                            
                                What is the datatype to store json object into postgresql?
                            
                                Temporary table in pgAdmin
                            
                                How to store year month in database?
                            
                                Postgresql transaction handling with java
                            
                                Can't get Postgres started
                            
                                Postgres Unique Constraint on two columns: Integer and Boolean
                            
                                Aggregate hstore column in PostreSQL
                            
                                How do i create postgres to oracle dblink?
                            
                                How to create and store array of objects in postgresql
                            
                                AWS RDS: How to Connect to Instance
                            
                                Golang pq: syntax error when executing sql
                            
                                GitLab Omnibus configuration for Postgres
                            
                                postgres return json from a function
                            
                                how to combine recursive CTE and normal CTE
                            
                                ActionView::Template::Error (PG::UndefinedFunction: ERROR: operator does not exist: integer ~~ unknown
                            
                                How to convert primary key from integer to serial?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With