Question is pretty much in the title. I can't find any detailed documentation regarding the differences.
I do notice a difference because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using 'cube', I got a lot of null values on the expressions I often grouped by.
ROLLUP and CUBE are simple extensions to the SELECT statement's GROUP BY clause. ROLLUP creates subtotals at any level of aggregation needed, from the most detailed up to a grand total. CUBE is an extension similar to ROLLUP , enabling a single statement to calculate all possible combinations of subtotals.
CUBE generates a result set that shows aggregates for all combinations of values in the selected columns. ROLLUP generates a result set that shows aggregates for a hierarchy of values in the selected columns.
Unlike the CUBE subclause, ROLLUP does not create all possible grouping sets based on the dimension columns; the CUBE makes a subset of those. When generating the grouping sets, ROLLUP assumes a hierarchy among the dimension columns and only generates grouping sets based on this hierarchy.
CUBE operator is used in the GROUP BY clause of a SELECT statement to return a result set of multidimensional (multiple columns) nature. Example: A table product has the following records:- Apparel. Brand.
These are not intended to work in the same way. groupBy
is simply an equivalent of the GROUP BY
clause in standard SQL. In other words
table.groupBy($"foo", $"bar")
is equivalent to:
SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
cube
is equivalent to CUBE
extension to GROUP BY
. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:
val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
df.show // +---+---+ // | x| y| // +---+---+ // |foo| 1| // |foo| 2| // |bar| 2| // |bar| 2| // +---+---+
and you compute cube(x, y)
with count as an aggregation:
df.cube($"x", $"y").count.show // +----+----+-----+ // | x| y|count| // +----+----+-----+ // |null| 1| 1| <- count of records where y = 1 // |null| 2| 3| <- count of records where y = 2 // | foo|null| 2| <- count of records where x = foo // | bar| 2| 2| <- count of records where x = bar AND y = 2 // | foo| 1| 1| <- count of records where x = foo AND y = 1 // | foo| 2| 1| <- count of records where x = foo AND y = 2 // |null|null| 4| <- total count of records // | bar|null| 2| <- count of records where x = bar // +----+----+-----+
A similar function to cube
is rollup
which computes hierarchical subtotals from left to right:
df.rollup($"x", $"y").count.show // +----+----+-----+ // | x| y|count| // +----+----+-----+ // | foo|null| 2| <- count where x is fixed to foo // | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2 // | foo| 1| 1| ... // | foo| 2| 1| ... // |null|null| 4| <- count where no column is fixed // | bar|null| 2| <- count where x is fixed to bar // +----+----+-----+
Just for comparison lets see the result of plain groupBy
:
df.groupBy($"x", $"y").count.show // +---+---+-----+ // | x| y|count| // +---+---+-----+ // |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP // |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP // |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP // +---+---+-----+
To summarize:
GROUP BY
every row is included only once in its corresponding summary.With GROUP BY CUBE(..)
every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL
placeholders):
SELECT NULL, NULL, COUNT(*) FROM table UNION ALL SELECT x, NULL, COUNT(*) FROM table GROUP BY x UNION ALL SELECT NULL, y, COUNT(*) FROM table GROUP BY y UNION ALL SELECT x, y, COUNT(*) FROM table GROUP BY x, y
With GROUP BY ROLLUP(...)
is similar to CUBE
but works hierarchically by filling colums from left to right.
SELECT NULL, NULL, COUNT(*) FROM table UNION ALL SELECT x, NULL, COUNT(*) FROM table GROUP BY x UNION ALL SELECT x, y, COUNT(*) FROM table GROUP BY x, y
ROLLUP
and CUBE
come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With