Is there a way to accomplish the following in BigQuery? This syntax is supported in a DB such as Postgres:
SELECT ProductGroup, Product, Year, Month, AVG(Revenue)
FROM Sales
group by rollup(ProductGroup, Product), rollup(Year, Month)
In other words, I want the cross product of the two rollups:
ROLLUP(ProductGroup, Product) --> (), (ProductGroup), (ProductGroup, Product)
ROLLUP(Year, Month) --> (), (Year), (Year, Month)
((), (ProductGroup), (ProductGroup, Product)) x ((), (Year), (Year, Month))
= (
(), (ProductGroup), (ProductGroup, Product),
(Year), (Year, ProductGroup), (Year, ProductGroup, Product).
(Year, Month), (Year, Month, ProductGroup), (Year, Month, ProductGroup, Product)
)
When attempting in BQ I get the following error:
The GROUP BY clause only supports ROLLUP when there are no other grouping elements at [2:10]
Here is an update with some sample pictures and data.
First, I want to replicate the functionality of an Excel Pivot table. This is where the cross product of the rollups from ROWS and COLS come into play:
Note that the Pivot table has 63 value cells.
Now, the correct SQL syntax for this is as follows in the verbose GROUP BY
-only syntax:
Notice that this also produces exactly 63 rows (and since we have one value column -- SUM Revenue -- 63 rows x 1 col = 63 value cells). The query is the following:
with sales as (
select 2010 Year, 'Jan' Month, 'Electronics' ProductGroup, 'Phone' Product, 100 Revenue union all
select 2010, 'Jan', 'Electronics', 'Laptop', 200 union all
select 2010, 'Jan', 'Cars', 'Jeep', 250 union all
select 2010, 'Jan', 'Cars', 'Hummer', 105 union all
select 2010, 'Feb', 'Electronics', 'Phone', 110 union all
select 2010, 'Feb', 'Electronics', 'Laptop', 300 union all
select 2010, 'Feb', 'Cars', 'Jeep', 50 union all
select 2010, 'Feb', 'Cars', 'Hummer', 75 union all
select 2010, 'Mar', 'Electronics', 'Phone', 80 union all
select 2010, 'Mar', 'Electronics', 'Laptop', 200 union all
select 2010, 'Mar', 'Cars', 'Jeep', 100 union all
select 2010, 'Mar', 'Cars', 'Hummer', 50 union all
select 2011, 'Jan', 'Electronics', 'Phone', 200 union all
select 2011, 'Jan', 'Electronics', 'Laptop', 300 union all
select 2011, 'Jan', 'Cars', 'Jeep', 100 union all
select 2011, 'Jan', 'Cars', 'Hummer', 200 union all
select 2011, 'Feb', 'Electronics', 'Phone', 300 union all
select 2011, 'Feb', 'Electronics', 'Laptop', 900 union all
select 2011, 'Feb', 'Cars', 'Jeep', 100 union all
select 2011, 'Feb', 'Cars', 'Hummer', 200 union all
select 2011, 'Mar', 'Electronics', 'Phone', 400 union all
select 2011, 'Mar', 'Electronics', 'Laptop', 350 union all
select 2011, 'Mar', 'Cars', 'Jeep', 240 union all
select 2011, 'Mar', 'Cars', 'Hummer', 130
)
-- ROLLUP(ProductGroup, Product), ROLLUP(Year, Month)
--> (), (ProductGroup), (ProductGroup, Product)
--> (Year), (Year, ProductGroup), (Year, ProductGroup, Product)
--> (Year, Month), (Year, Month, ProductGroup), (Year, Month, ProductGroup, Product)
SELECT NULL, NULL, NULL, NULL, AVG(Revenue) FROM Sales UNION ALL -- ()
SELECT ProductGroup, NULL, NULL, NULL, AVG(Revenue) FROM Sales GROUP BY ProductGroup UNION ALL -- (ProductGroup)
SELECT ProductGroup, Product, NULL, NULL, AVG(Revenue) FROM Sales GROUP BY ProductGroup, Product UNION ALL -- (ProductGroup, Product)
SELECT NULL, NULL, Year, NULL, AVG(Revenue) FROM Sales GROUP BY Year UNION ALL -- (Year)
SELECT ProductGroup, NULL, Year, NULL, AVG(Revenue) FROM Sales GROUP BY Year, ProductGroup UNION ALL -- (Year, ProductGroup)
SELECT ProductGroup, Product, Year, NULL, AVG(Revenue) FROM Sales GROUP BY Year, ProductGroup, Product UNION ALL-- (Year, ProductGroup, Product)
SELECT NULL, NULL, Year, Month, AVG(Revenue) FROM Sales GROUP BY Year, Month UNION ALL -- (Year, Month)
SELECT ProductGroup, NULL, Year, Month, AVG(Revenue) FROM Sales GROUP BY ProductGroup, Year, Month UNION ALL -- (ProductGroup, Year, Month)
SELECT ProductGroup, Product, Year, Month, AVG(Revenue) FROM Sales GROUP BY ProductGroup, Product, Year, Month -- (ProductGroup, Product, Year Month)
However, this query is really a nightmare to product -- even if generated programatically -- as there may be an order by
, subselect
, ... etc and union-ing all those statements together could potentially turn into a monstrous construction (for example, a 3 rows x 3 cols construction with a 100-line SQL statement would become 4^2 * 100 lines of sql, and 5x5 would be 5^2 * 100 lines, etc. if my math is correct).
What would be the proper way to do this then? Note that in a database like Postgres the following works as-is:
SELECT ProductGroup, Product, Year, Month, AVG(Revenue) FROM Sales GROUP BY ROLLUP(ProductGroup, Product), ROLLUP(Year, Month);
Here is the Saved Query if you want to use this as a starting point: https://console.cloud.google.com/bigquery?sq=260144861653:552549d2a81a47b59df6e3d16ef9bf17.
Finally, if you think it would be a useful feature to add in the GROUPING SETS
and CUBE
, please upvote this feature request: https://issuetracker.google.com/issues/204913323.
Unlike the CUBE subclause, ROLLUP does not create all possible grouping sets based on the dimension columns; the CUBE makes a subset of those. When generating the grouping sets, ROLLUP assumes a hierarchy among the dimension columns and only generates grouping sets based on this hierarchy.
Introduction to SQL ROLLUP The ROLLUP is an extension of the GROUP BY clause. The ROLLUP option allows you to include extra rows that represent the subtotals, which are commonly referred to as super-aggregate rows, along with the grand total row.
ROLLUP and CUBE are simple extensions to the SELECT statement's GROUP BY clause. ROLLUP creates subtotals at any level of aggregation needed, from the most detailed up to a grand total. CUBE is an extension similar to ROLLUP , enabling a single statement to calculate all possible combinations of subtotals.
It allows us to generate multiple grouping sets within a single query, which is impossible with the GROUP BY clause as it aggregates a single group. Thus, we can say that the ROLLUP provides a more detailed analysis by employing a single query to create several grouping sets along the hierarchy of columns.
Back to my original answer of grouping twice...
WITH
rollup_pg_p AS
(
SELECT
ProductGroup, Product, 1 AS dummy, Year, Month, SUM(Revenue) AS sum_rev, COUNT(Revenue) AS cnt_row
FROM
Sales
GROUP BY
ROLLUP(Year, Month, ProductGroup, Product)
HAVING
Month IS NOT NULL -- This prevents the roll up going further than desired
-- Effectively giving `GROUP BY Year, Month, ROLLUP(ProductGroup, Product)
)
SELECT
ProductGroup, Product, Year, Month, SUM(sum_rev) / SUM(cnt_row)
FROM
rollup_pg_p
GROUP BY
ROLLUP(ProductGroup, Product, Dummy, Year, Month)
HAVING
dummy IS NOT NULL -- Same 'trick' again, but we created the dummy column
-- as ProductGroup and Product CAN legitimately be NULL at this point.
ORDER BY
1, 2, 3, 4
(NOTE: Other dialects would use WHERE NOT GROUPING(Product)
, so avoid the need for the dummy column, but BigQuery doesn't appear to have that functionality either...)
Still has the downside of not working for some aggregates, but might be meaningfully faster than the alternative approaches.
ugly but probably the simplest way to have 3 group by statement and union them :
SELECT ProductGroup,Product,NULL year ,NULL month, AVG(sales.Revenue) avg
FROM sales
GROUP BY ROLLUP(ProductGroup,Product)
UNION DISTINCT
SELECT ProductGroup,Product,Year,NULL month, AVG(sales.Revenue) avg
FROM sales
GROUP BY ROLLUP(Year, ProductGroup, Product)
UNION DISTINCT
SELECT ProductGroup,Product,Year,MONTH, AVG(sales.Revenue) avg
FROM sales
GROUP BY ROLLUP(Year, Month, ProductGroup, Product)
GCP fiddle
it gives the correct number of rows but how would I pass along the AVG(Revenue)?
Consider below - looks like a simple pattern to apply to more of your potential cases
select r.Year, r.Month, r.ProductGroup, r.Product,
round(avg(Revenue), 2) avg_Revenue
from (
select * from (
select ProductGroup, Product from Sales group by rollup(ProductGroup, Product)
), (
select Year, Month from Sales group by rollup(Year, Month)
)
) r
join sales s
on if(r.Year is null, true, r.Year = s.Year)
and if(r.Month is null, true, r.Month = s.Month)
and if(r.ProductGroup is null, true, r.ProductGroup = s.ProductGroup)
and if(r.Product is null, true, r.Product = s.Product)
group by r.Year, r.Month, r.ProductGroup, r.Product
if applied to sample data in your script - output is (cut by top rows)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With