I have the following data in a <code>reviews</code> table for certain set of items, using a score system that ranges from 0 to 100 <pre class="prettyprint"><code>+-----------+---------+-------+ | review_id | item_id | score | +-----------+---------+-------+ | 1 | 1 | 90 | +-----------+---------+-------+ | 2 | 1 | 40 | +-----------+---------+-------+ | 3 | 1 | 10 | +-----------+---------+-------+ | 4 | 2 | 90 | +-----------+---------+-------+ | 5 | 2 | 90 | +-----------+---------+-------+ | 6 | 2 | 70 | +-----------+---------+-------+ | 7 | 3 | 80 | +-----------+---------+-------+ | 8 | 3 | 80 | +-----------+---------+-------+ | 9 | 3 | 80 | +-----------+---------+-------+ | 10 | 3 | 80 | +-----------+---------+-------+ | 11 | 4 | 10 | +-----------+---------+-------+ | 12 | 4 | 30 | +-----------+---------+-------+ | 13 | 4 | 50 | +-----------+---------+-------+ | 14 | 4 | 80 | +-----------+---------+-------+ </code></pre> I am trying to create a histogram of the score values with a bin size of five. My goal is to generate a histogram per item. In order to create a histogram of the entire table, it is possible to use the <code>width_bucket</code>. This can also be tuned to operate on a per-item basis: <pre class="prettyprint"><code>SELECT item_id, g.n as bucket, COUNT(m.score) as count FROM generate_series(1, 5) g(n) LEFT JOIN review as m ON width_bucket(score, 0, 100, 4) = g.n GROUP BY item_id, g.n ORDER BY item_id, g.n; </code></pre> However, the result looks like this: <pre class="prettyprint"><code>+---------+--------+-------+ | item_id | bucket | count | +---------+--------+-------+ | 1 | 5 | 1 | +---------+--------+-------+ | 1 | 3 | 1 | +---------+--------+-------+ | 1 | 1 | 1 | +---------+--------+-------+ | 2 | 5 | 2 | +---------+--------+-------+ | 2 | 4 | 2 | +---------+--------+-------+ | 3 | 4 | 4 | +---------+--------+-------+ | 4 | 1 | 1 | +---------+--------+-------+ | 4 | 2 | 1 | +---------+--------+-------+ | 4 | 3 | 1 | +---------+--------+-------+ | 4 | 4 | 1 | +---------+--------+-------+ </code></pre> That is, bins with no entries are not included. While I find this not to be a bad solution, I would rather have either all buckets, with 0 on those with no entries. Even better, using this structure: <pre class="prettyprint"><code>+---------+----------+----------+----------+----------+----------+ | item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 | +---------+----------+----------+----------+----------+----------+ | 1 | 1 | 0 | 1 | 0 | 1 | +---------+----------+----------+----------+----------+----------+ | 2 | 0 | 0 | 0 | 2 | 2 | +---------+----------+----------+----------+----------+----------+ | 3 | 0 | 0 | 0 | 4 | 0 | +---------+----------+----------+----------+----------+----------+ | 4 | 1 | 1 | 1 | 1 | 0 | +---------+----------+----------+----------+----------+----------+ </code></pre> I prefer this solution as it uses a row per item (instead of <code>5n</code>), which is simpler to query and minimizes memory consumption and data transfer costs. My current approach is as follows: <pre class="prettyprint"><code>select item_id, (sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1, (sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2, (sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3, (sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4, (sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5 from review; </code></pre> Even though this query satisfies my requirements, I am curious to see if there might be a more elegant approach. so many <code>case</code> statements are not easy to read and changes in the bin criteria might require updating every sum. Also I am curious about the potential performance concerns that this query might have.

The second query can be rewritten to use ranges to make editing and writing the query a bit easier: <pre class="prettyprint"><code>with buckets (b1, b2, b3, b4, b5) as ( values ( int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100) ) ) select item_id, count(*) filter (where b1 @> score) as bucket_1, count(*) filter (where b2 @> score) as bucket_2, count(*) filter (where b3 @> score) as bucket_3, count(*) filter (where b4 @> score) as bucket_4, count(*) filter (where b5 @> score) as bucket_5 from review cross join buckets group by item_id order by item_id; </code></pre> A range constructed with <code>int4range(0,20)</code> includes the lower end and excludes the upper end. The CTE named <code>buckets</code> only creates a single row, so the cross join does not change the number of rows from the <code>review</code> table.

Generate a histogram of values grouped by a column

Tags:

postgresql

histogram

I have the following data in a reviews table for certain set of items, using a score system that ranges from 0 to 100

+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1         | 1       | 90    |
+-----------+---------+-------+
| 2         | 1       | 40    |
+-----------+---------+-------+
| 3         | 1       | 10    |
+-----------+---------+-------+
| 4         | 2       | 90    |
+-----------+---------+-------+
| 5         | 2       | 90    |
+-----------+---------+-------+
| 6         | 2       | 70    |
+-----------+---------+-------+
| 7         | 3       | 80    |
+-----------+---------+-------+
| 8         | 3       | 80    |
+-----------+---------+-------+
| 9         | 3       | 80    |
+-----------+---------+-------+
| 10        | 3       | 80    |
+-----------+---------+-------+
| 11        | 4       | 10    |
+-----------+---------+-------+
| 12        | 4       | 30    |
+-----------+---------+-------+
| 13        | 4       | 50    |
+-----------+---------+-------+
| 14        | 4       | 80    |
+-----------+---------+-------+

I am trying to create a histogram of the score values with a bin size of five. My goal is to generate a histogram per item. In order to create a histogram of the entire table, it is possible to use the width_bucket. This can also be tuned to operate on a per-item basis:

SELECT item_id, g.n as bucket, COUNT(m.score) as count 
FROM generate_series(1, 5) g(n) LEFT JOIN
     review as m
     ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;

However, the result looks like this:

+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1       | 5      | 1     |
+---------+--------+-------+
| 1       | 3      | 1     |
+---------+--------+-------+
| 1       | 1      | 1     |
+---------+--------+-------+
| 2       | 5      | 2     |
+---------+--------+-------+
| 2       | 4      | 2     |
+---------+--------+-------+
| 3       | 4      | 4     |
+---------+--------+-------+
| 4       | 1      | 1     |
+---------+--------+-------+
| 4       | 2      | 1     |
+---------+--------+-------+
| 4       | 3      | 1     |
+---------+--------+-------+
| 4       | 4      | 1     |
+---------+--------+-------+

That is, bins with no entries are not included. While I find this not to be a bad solution, I would rather have either all buckets, with 0 on those with no entries. Even better, using this structure:

+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1       | 1        | 0        | 1        | 0        | 1        |
+---------+----------+----------+----------+----------+----------+
| 2       | 0        | 0        | 0        | 2        | 2        |
+---------+----------+----------+----------+----------+----------+
| 3       | 0        | 0        | 0        | 4        | 0        |
+---------+----------+----------+----------+----------+----------+
| 4       | 1        | 1        | 1        | 1        | 0        |
+---------+----------+----------+----------+----------+----------+

I prefer this solution as it uses a row per item (instead of 5n), which is simpler to query and minimizes memory consumption and data transfer costs. My current approach is as follows:

select item_id, 
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;

Even though this query satisfies my requirements, I am curious to see if there might be a more elegant approach. so many case statements are not easy to read and changes in the bin criteria might require updating every sum. Also I am curious about the potential performance concerns that this query might have.

581

asked Jul 17 '18 14:07

martinarroyo

1 Answers

The second query can be rewritten to use ranges to make editing and writing the query a bit easier:

with buckets (b1, b2, b3, b4, b5) as (
  values ( 
     int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100) 
  )
)
select item_id, 
       count(*) filter (where b1 @> score) as bucket_1,
       count(*) filter (where b2 @> score) as bucket_2,
       count(*) filter (where b3 @> score) as bucket_3,
       count(*) filter (where b4 @> score) as bucket_4,
       count(*) filter (where b5 @> score) as bucket_5
from review 
  cross join buckets
group by item_id
order by item_id;

A range constructed with int4range(0,20) includes the lower end and excludes the upper end.

The CTE named buckets only creates a single row, so the cross join does not change the number of rows from the review table.

142

answered Nov 15 '22 10:11

a_horse_with_no_name

Related questions
                            
                                Can I save memory by writing SQL instead of ActiveRecord?
                            
                                Django - distinct rows/objects distinguished by date/day from datetime field
                            
                                tuple concurrently updated when creating functions in postgresql / PL/pgSQL
                            
                                calling SQL functions from Blaze
                            
                                Vapor Framework : Configure a postgres connection with SSL
                            
                                How to connect to a Postgres database without specifying a database name in PDO?
                            
                                Postgres WITH RECURSIVE CTE: sorting/ordering children by popularity while retaining tree structure (parents always above children)
                            
                                How to run tests in django using database with data?
                            
                                Postgres - how to deploy changes from dev database to live version
                            
                                SQL patindex equivalent in PostgreSQL
                            
                                Window function to fill timeline gaps
                            
                                postgresql cannot create table with pseudo-type record[]
                            
                                postgresql weird invalid input syntax for type numeric: "" while value is not an empty varchar
                            
                                Sequence does not exist when it does - Postgres/Spring Boot
                            
                                PostgreSQL extract keys from jsonb, exception "cannot call jsonb_object_keys on a scalar"
                            
                                PostgreSQL - Grant select on all tables (and future tables), in *all schemas*
                            
                                Locks on updating rows with foreign key constraint
                            
                                LAG() / LEAD() of the next rank (Postgresql)
                            
                                How to convert pgx.Rows from Query() to json array?
                            
                                There is a column named ... it cannot be referenced from this part of the query sub query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With