Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create buckets and groups within those buckets using PostgresQL

Tags:

postgresql

How to find the distribution of credit cards by year, and completed transaction. Group these credit cards into three buckets: less than 10 transactions, between 10 and 30 transactions, more than 30 transactions?

The first method I tried to use was using the width_buckets function in PostgresQL, but the documentation says that only creates equidistant buckets, which is not what I want in this case. Because of that, I turned to case statements. However, I'm not sure how to use the case statement with a group by.

This is the data I am working with:

table 1 - credit_cards table
credit_card_id
year_opened


table 2 - transactions table
transaction_id
credit_card_id - matches credit_cards.credit_card_id
transaction_status ("complete" or "incomplete")

This is what I have gotten so far:

SELECT 

CASE WHEN transaction_count < 10 THEN “Less than 10”
WHEN transaction_count >= 10 and transaction_count < 30 THEN “10 <= transaction count < 30”
ELSE transaction_count>=30 THEN “Greater than or equal to 30”
END as buckets

count(*) as ct.transaction_count
FROM credit_cards c
INNER JOIN transactions t
ON c.credit_card_id = t.credit_card_id
WHERE t.status = “completed”
GROUP BY v.year_opened

GROUP BY buckets
ORDER BY buckets

Expected output

credit card count | year opened | transaction count bucket
23421             | 2002        | Less than 10
etc
like image 525
pr338 Avatar asked Jul 03 '18 17:07

pr338


People also ask

What is bucket in PostgreSQL?

The PostgreSQL NTILE() function allows you to divide ordered rows in the partition into a specified number of ranked groups as equal size as possible. These ranked groups are called buckets.

How does group by work in PostgreSQL?

The PostgreSQL GROUP BY clause is used in collaboration with the SELECT statement to group together those rows in a table that have identical data. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups.

Can you store lists in PostgreSQL?

PostgreSQL gives you this capability with the array datatype. Arrays are some of the most useful data types for storing lists of information.


2 Answers

You can specify the bin sizes in width_bucket by specifying a sorted array of the lower bound of each bin.

In you case, it would be array[10,30]: anything less than 10 gets bin 0, between 10 and 29 gets bin 1 and 30 or more gets bin 2.

WITH a AS (select generate_series(5,35) cnt)
SELECT  cnt, width_bucket(cnt, array[10,30]) 
FROM a;
like image 136
JGH Avatar answered Sep 19 '22 13:09

JGH


To figure this out you need to count transactions per credit card in order to figure out the right bucket, then you need to count the credit cards per bucket per year. There are a couple of different ways to get the final result. One way is to first join up all your data and compute the first level of aggregate values. Then compute the final level of aggregate values:

with t1 as (
  select year_opened
     , c.credit_card_id
     , case when count(*) < 10 then 'Less than 10'
            when count(*) < 30 then 'Between [10 and 30)'
            else 'Greater than or equal to 30'
       end buckets
  from credit_cards c
  join transactions t
    on t.credit_card_id = c.credit_card_id
 where t.transaction_status = 'complete'
 group by year_opened
     , c.credit_card_id
)
select count(*) credit_card_count
     , year_opened
     , buckets
  from t1
 group by year_opened
     , buckets;

However, it may be more perforamant first calculate the first level of aggregate data on the transactions table before joining it to the credit cards table:

select count(*) credit_card_count
     , year_opened
     , buckets
  from credit_cards c
  join (select credit_card_id
             , case when count(*) < 10 then 'Less than 10'
                    when count(*) < 30 then 'Between [10 and 30)'
                    else 'Greater than or equal to 30'
               end buckets
          from transactions
         group by credit_card_id) t
    on t.credit_card_id = c.credit_card_id
 group by year_opened
     , buckets;

If you prefer to unroll the above query and uses Common Table Expressions, you can do that too (I find this easier to read/follow along):

with bkt as (
  select credit_card_id
       , case when count(*) < 10 then 'Less than 10'
              when count(*) < 30 then 'Between [10 and 30)'
              else 'Greater than or equal to 30'
          end buckets
    from transactions
   group by credit_card_id
)
select count(*) credit_card_count
     , year_opened
     , buckets
  from credit_cards c
  join bkt t
    on t.credit_card_id = c.credit_card_id
 group by year_opened
     , buckets;
like image 32
Sentinel Avatar answered Sep 19 '22 13:09

Sentinel