What is the difference between cube, rollup and groupBy operators?

Tags:

Question is pretty much in the title. I can't find any detailed documentation regarding the differences.

I do notice a difference because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using 'cube', I got a lot of null values on the expressions I often grouped by.

793

asked Jun 22 '16 18:06

Eric Staner

1 Answers

These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words

table.groupBy($"foo", $"bar")

is equivalent to:

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")

df.show  // +---+---+ // |  x|  y| // +---+---+ // |foo|  1| // |foo|  2| // |bar|  2| // |bar|  2| // +---+---+

and you compute cube(x, y) with count as an aggregation:

df.cube($"x", $"y").count.show  // +----+----+-----+      // |   x|   y|count| // +----+----+-----+ // |null|   1|    1|   <- count of records where y = 1 // |null|   2|    3|   <- count of records where y = 2 // | foo|null|    2|   <- count of records where x = foo // | bar|   2|    2|   <- count of records where x = bar AND y = 2 // | foo|   1|    1|   <- count of records where x = foo AND y = 1 // | foo|   2|    1|   <- count of records where x = foo AND y = 2 // |null|null|    4|   <- total count of records // | bar|null|    2|   <- count of records where x = bar // +----+----+-----+

A similar function to cube is rollup which computes hierarchical subtotals from left to right:

df.rollup($"x", $"y").count.show // +----+----+-----+ // |   x|   y|count| // +----+----+-----+ // | foo|null|    2|   <- count where x is fixed to foo // | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2 // | foo|   1|    1|   ... // | foo|   2|    1|   ... // |null|null|    4|   <- count where no column is fixed // | bar|null|    2|   <- count where x is fixed to bar // +----+----+-----+

Just for comparison lets see the result of plain groupBy:

df.groupBy($"x", $"y").count.show  // +---+---+-----+ // |  x|  y|count| // +---+---+-----+ // |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP // |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP // |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP // +---+---+-----+

To summarize:

When using plain GROUP BY every row is included only once in its corresponding summary.

With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):

SELECT NULL, NULL, COUNT(*) FROM table UNION ALL SELECT x,    NULL, COUNT(*) FROM table GROUP BY x UNION ALL SELECT NULL, y,    COUNT(*) FROM table GROUP BY y UNION ALL SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.

SELECT NULL, NULL, COUNT(*) FROM table UNION ALL SELECT x,    NULL, COUNT(*) FROM table GROUP BY x UNION ALL SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.

181

answered Sep 28 '22 02:09

zero323

Related questions
                            
                                How can I reorder rows in sql database
                            
                                Rails - Delete all Records that Meet a Condition
                            
                                What column data type should I use for storing large amounts of text or html
                            
                                ORDER BY "ENUM field" in MYSQL
                            
                                Where value in column containing comma delimited values
                            
                                How can I delete one of two perfectly identical rows?
                            
                                Why use SQLAlchemy? Is it very convinent for coding? [closed]
                            
                                Sample database for exercise [closed]
                            
                                Finding columns that are NOT NULL in PostgreSQL
                            
                                Variable column names using prepared statements
                            
                                ORA-01652: unable to extend temp segment by 128 in tablespace SYSTEM: How to extend?
                            
                                How to insert date values into table
                            
                                How to alter column nvarchar length without drop
                            
                                Executing a stored procedure inside BEGIN/END TRANSACTION
                            
                                Why are batch inserts/updates faster? How do batch updates work?
                            
                                How to insert a blob into a database using sql server management studio
                            
                                How to use a package constant in SQL SELECT statement?
                            
                                Spark Window Functions - rangeBetween dates
                            
                                Postgresql - Using subqueries with alter sequence expressions
                            
                                What are the benefits of using database cursor?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between cube, rollup and groupBy operators?

Tags:

sql

apache-spark

apache-spark-sql

rollup

cube

Eric Staner

People also ask

1 Answers

zero323

Recent Activity

Donate For Us