How to GROUPING SETS as operator/method on Dataset?

Tags:

Is there no function level grouping_sets support in spark scala?

I have no idea this patch applied to master https://github.com/apache/spark/pull/5080

I want to do this kind of query by scala dataframe api.

GROUP BY expression list GROUPING SETS(expression list2)

cube and rollup functions are available in Dataset API, but can't find grouping sets. Why?

878

asked Dec 02 '16 02:12

Jihun No

1 Answers

I want to do this kind of query by scala dataframe api.

tl;dr Up to Spark 2.1.0 it is not possible. There are currently no plans to add such an operator to Dataset API.

Spark SQL supports the following so-called multi-dimensional aggregate operators:

rollup operator
cube operator
GROUPING SETS clause (only in SQL mode)
grouping() and grouping_id() functions

NOTE: GROUPING SETS is only available in SQL mode. There is no support in Dataset API.

GROUPING SETS

val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)

scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)

scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+

If a value in a column of resulting table is null, it may not necessarily mean that the column was aggregated on that row. If that column has nulls in the original table, null value in the aggregations table may represent just a null value from the original table. Use grouping function to check if the column was aggregated on the specific row or not.

answered Sep 18 '22 00:09

Jacek Laskowski

Related questions
                            
                                Using groupBy in Spark and getting back to a DataFrame
                            
                                Add Yarn cluster configuration to Spark application
                            
                                How to pass additional parameters to user-defined methods in pyspark for filter method?
                            
                                How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?
                            
                                Replace new line (\n) character in csv file - spark scala
                            
                                Why are "sc.addFile" and "spark-submit --files" not distributing a local file to all workers?
                            
                                How can I read in a binary file from hdfs into a Spark dataframe?
                            
                                How to get date and time from string?
                            
                                Conflict between httpclient version and Apache Spark
                            
                                pyspark expected zero arguments for construction of ClassDict (for pyspark.mllib.linalg.DenseVector)
                            
                                Install Spark on an existing Hadoop cluster
                            
                                How to register S3 Parquet files in a Hive Metastore using Spark on EMR
                            
                                create hive external table with schema in spark
                            
                                Pyspark command not recognised
                            
                                Scala: How to get a range of rows in a dataframe
                            
                                PYSPARK : casting string to float when reading a csv file
                            
                                Creating a Spark DataFrame from a single string
                            
                                pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989
                            
                                What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?
                            
                                Spark - GraphX - scaling connected components

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to GROUPING SETS as operator/method on Dataset?

Tags:

dataframe

apache-spark

apache-spark-sql

Jihun No

People also ask

1 Answers

GROUPING SETS

Jacek Laskowski

Recent Activity

Donate For Us