Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient GROUP BY a CASE expression in Amazon Redshift/PostgreSQL

In analytics processing there is often a need to collapse "unimportant" groups of data into a single row in the resulting table. One way to do this is to GROUP BY a CASE expression where unimportant groups are coalesced into a single row via the CASE expression returning a single value, e.g., NULL for the groups. This question is about efficient ways to perform this grouping in Amazon Redshift, which is based on ParAccel: close to PosgreSQL 8.0 in terms of functionality.

As an example, consider a GROUP BY on type and url in a table where each row is a single URL visit. The goal is to perform aggregation such that one row is emitted for every (type, url) pair where the URL visit count exceeds a certain threshold and one (type, NULL) row is emitted for all (type, url) pairs where the visit count is under that threshold. The rest of the columns in the result table would have SUM/COUNT aggregates based on this grouping.

For example, the following data

+------+----------------------+-----------------------+
| type | url                  | < 50+ other columns > |
+------+----------------------+-----------------------+
|  A   | http://popular.com   |                       |
|  A   | http://popular.com   |                       |
|  A   | < 9997 more times>   |                       |
|  A   | http://popular.com   |                       |
|  A   | http://small-one.com |                       |
|  B   | http://tiny.com      |                       |
|  B   | http://tiny-too.com  |                       |

should produce the following result table with a threshold of 10,000

+------+------------------------------------+--------------------------+
| type | url                  | visit_count | < SUM/COUNT aggregates > |
+------+------------------------------------+--------------------------+
|  A   | http://popular.com   |       10000 |                          |
|  A   |                      |           1 |                          |
|  B   |                      |           2 |                          |

Summary:

Amazon Redshift has certain subquery correlation limitations one needs to tip-toe around. Gordon Linoff's answer below (the accepted answer) shows how to perform a GROUP BY a CASE expression using double aggregation and replicating the expression in both the result column and the outer GROUP BY clause.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt
from temp_counts
group by type, (case when cnt >= 10000 then url end)

Further testing indicated that the double aggregation can be "unrolled" into a UNION ALL of independent queries involving each independent CASE expression. In this particular case on a sample data set with approximately 200M rows, this approach consistently performed about 30% faster. That result is schema and data-specific, however.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select * from temp_counts WHERE cnt >= 10000
UNION ALL
SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts 
WHERE cnt < 10000 
GROUP BY type

This suggests two general patterns for implementing and optimizing arbitrary disjoined grouping & summarization in Amazon Redshift. If performance is important for you, benchmark both.

like image 343
Sim Avatar asked Oct 03 '22 00:10

Sim


2 Answers

You would do this with two aggregations:

select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt
from (select type, url, count(*) as cnt
      from t
      group by type, url
     ) t
group by type, (case when cnt > XXX then url end)
order by type, sum(cnt) desc;
like image 138
Gordon Linoff Avatar answered Oct 10 '22 12:10

Gordon Linoff


  1. First, you group on type, url.
  2. Then you group a 2nd time on type, case when visit_count < 10000 then NULL else url.

I've used SQL Server syntax, I hope that works for Postgres, too.

like image 31
usr Avatar answered Oct 10 '22 12:10

usr