I'm using the the SQL API for Spark 3.0 in a Databricks 7.0 runtime cluster. I know that I can do the following:
select
coalesce(a, "All A") as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by rollup (
colA,
colB
)
order by
colA asc,
colB asc
I'd then expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| All A | All B | 300 |
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
However, I'm trying to write a query where only column b needs to be rolled up. I've written something like:
select
a as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by
a,
rollup (b)
order by
colA asc,
colB asc
And I'd expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
I know this sort of operation is supported in at least some SQL APIs, but I get Error in SQL statement: UnsupportedOperationException when trying to run the above query. Does anyone know whether this behavior is simply as-of-yet unsupported in Spark 3.0 or if I just have the syntax wrong? The docs aren't helpful on the subject.
I know that I can accomplish this with union all, but I'd prefer to avoid that route, if only for the sake of elegance and brevity.
Thanks in advance, and please let me know if I can clarify anything.
Try this GROUPING SETS option:
%sql
SELECT
COALESCE( a, 'all a' ) a,
COALESCE( b, 'all b' ) b,
SUM(c) c
FROM myTable
GROUP BY a, b
GROUPING SETS ( ( a , b ), a )
ORDER BY a, b
My results (with updated numbers):

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With