Spark SQL: apply aggregate functions to a list of columns

Tags:

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when doing a groupBy? In other words, is there a way to avoid doing this for every column:

df.groupBy("col1")
  .agg(sum("col2").alias("col2"), sum("col3").alias("col3"), ...)

843

asked Nov 23 '15 23:11

lilloraffa

3 Answers

There are multiple ways of applying aggregate functions to multiple columns.

GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum, which can be used directly as follows:

Python:

df = sqlContext.createDataFrame(
    [(1.0, 0.3, 1.0), (1.0, 0.5, 0.0), (-1.0, 0.6, 0.5), (-1.0, 5.6, 0.2)],
    ("col1", "col2", "col3"))

df.groupBy("col1").sum()

## +----+---------+-----------------+---------+
## |col1|sum(col1)|        sum(col2)|sum(col3)|
## +----+---------+-----------------+---------+
## | 1.0|      2.0|              0.8|      1.0|
## |-1.0|     -2.0|6.199999999999999|      0.7|
## +----+---------+-----------------+---------+

Scala

val df = sc.parallelize(Seq(
  (1.0, 0.3, 1.0), (1.0, 0.5, 0.0),
  (-1.0, 0.6, 0.5), (-1.0, 5.6, 0.2))
).toDF("col1", "col2", "col3")

df.groupBy($"col1").min().show

// +----+---------+---------+---------+
// |col1|min(col1)|min(col2)|min(col3)|
// +----+---------+---------+---------+
// | 1.0|      1.0|      0.3|      0.0|
// |-1.0|     -1.0|      0.6|      0.2|
// +----+---------+---------+---------+

Optionally you can pass a list of columns which should be aggregated

df.groupBy("col1").sum("col2", "col3")

You can also pass dictionary / map with columns a the keys and functions as the values:

Python

exprs = {x: "sum" for x in df.columns}
df.groupBy("col1").agg(exprs).show()

## +----+---------+
## |col1|avg(col3)|
## +----+---------+
## | 1.0|      0.5|
## |-1.0|     0.35|
## +----+---------+

Scala

val exprs = df.columns.map((_ -> "mean")).toMap
df.groupBy($"col1").agg(exprs).show()

// +----+---------+------------------+---------+
// |col1|avg(col1)|         avg(col2)|avg(col3)|
// +----+---------+------------------+---------+
// | 1.0|      1.0|               0.4|      0.5|
// |-1.0|     -1.0|3.0999999999999996|     0.35|
// +----+---------+------------------+---------+

Finally you can use varargs:

Python

from pyspark.sql.functions import min

exprs = [min(x) for x in df.columns]
df.groupBy("col1").agg(*exprs).show()

Scala

import org.apache.spark.sql.functions.sum

val exprs = df.columns.map(sum(_))
df.groupBy($"col1").agg(exprs.head, exprs.tail: _*)

There are some other way to achieve a similar effect but these should more than enough most of the time.

zero323

Another example of the same concept - but say - you have 2 different columns - and you want to apply different agg functions to each of them i.e

f.groupBy("col1").agg(sum("col2").alias("col2"), avg("col3").alias("col3"), ...)

Here is the way to achieve it - though I do not yet know how to add the alias in this case

See the example below - Using Maps

val Claim1 = StructType(Seq(StructField("pid", StringType, true),StructField("diag1", StringType, true),StructField("diag2", StringType, true), StructField("allowed", IntegerType, true), StructField("allowed1", IntegerType, true)))
val claimsData1 = Seq(("PID1", "diag1", "diag2", 100, 200), ("PID1", "diag2", "diag3", 300, 600), ("PID1", "diag1", "diag5", 340, 680), ("PID2", "diag3", "diag4", 245, 490), ("PID2", "diag2", "diag1", 124, 248))

val claimRDD1 = sc.parallelize(claimsData1)
val claimRDDRow1 = claimRDD1.map(p => Row(p._1, p._2, p._3, p._4, p._5))
val claimRDD2DF1 = sqlContext.createDataFrame(claimRDDRow1, Claim1)

val l = List("allowed", "allowed1")
val exprs = l.map((_ -> "sum")).toMap
claimRDD2DF1.groupBy("pid").agg(exprs) show false
val exprs = Map("allowed" -> "sum", "allowed1" -> "avg")

claimRDD2DF1.groupBy("pid").agg(exprs) show false

answered Oct 16 '22 11:10

Sumit Pal

Current answers are perfectly correct on how to create the aggregations, but none actually address the column alias/renaming that is also requested in the question.

Typically, this is how I handle this case:

val dimensionFields = List("col1")
val metrics = List("col2", "col3", "col4")
val columnOfInterests = dimensions ++ metrics

val df = spark.read.table("some_table") 
    .select(columnOfInterests.map(c => col(c)):_*)
    .groupBy(dimensions.map(d => col(d)): _*)
    .agg(metrics.map( m => m -> "sum").toMap)
    .toDF(columnOfInterests:_*)    // that's the interesting part

The last line essentially renames every columns of the aggregated dataframe to the original fields, essentially changing sum(col2) and sum(col3) to simply col2 and col3.

answered Oct 16 '22 13:10

Philippe Oger

Related questions
                            
                                Cannot find col function in pyspark
                            
                                pyspark dataframe filter or include based on list
                            
                                how to filter out a null value from spark dataframe
                            
                                How to find median and quantiles using Spark
                            
                                Pyspark: Split multiple array columns into rows
                            
                                What is the relationship between workers, worker instances, and executors?
                            
                                Is it possible to get the current spark context settings in PySpark?
                            
                                How to pivot Spark DataFrame?
                            
                                how to make saveAsTextFile NOT split output into multiple file?
                            
                                How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?
                            
                                Pyspark: Exception: Java gateway process exited before sending the driver its port number
                            
                                How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
                            
                                Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey
                            
                                Which cluster type should I choose for Spark?
                            
                                How does HashPartitioner work?
                            
                                How to link PyCharm with PySpark?
                            
                                How to pass -D parameter or environment variable to Spark job?
                            
                                Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
                            
                                How to write unit tests in Spark 2.0+?
                            
                                Updating a dataframe column in spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark SQL: apply aggregate functions to a list of columns

Tags:

dataframe

aggregate-functions

apache-spark

apache-spark-sql

lilloraffa

People also ask

3 Answers

zero323

Sumit Pal

Philippe Oger

Recent Activity

Donate For Us