How to do aggregation on multiple columns at once in Spark

Tags:

I have a dataframe which has multiple columns. I want to group by one of the columns and aggregate other columns all the once. Let's say the table have 4 columns, cust_id, f1,f2,f3 and I want to group by cust_id and then get avg(f1), avg(f2) and avg(f3).The table will have many columns. Any hints?

The following code is good start but as I have many columns it may not be good idea to manually write them.

df.groupBy("cust_id").agg(sum("f1"), sum("f2"), sum("f3"))

773

asked Aug 12 '16 19:08

HHH

1 Answers

Maybe you can try mapping a list with the colum names:

val groupCol = "cust_id"
val aggCols = (df.columns.toSet - groupCol).map(
  colName => avg(colName).as(colName + "_avg")
).toList

df.groupBy(groupCol).agg(aggCols.head, aggCols.tail: _*)

Alternatively, if needed, you can also match the schema and build the aggregations based on the type:

val aggCols = df.schema.collect {
  case StructField(colName, IntegerType, _, _) => avg(colName).as(colName + "_avg")
  case StructField(colName, StringType, _, _) => first(colName).as(colName + "_first")
}

165

answered Sep 28 '22 06:09

Daniel de Paula

Related questions
                            
                                How to create a Play project in IntelliJ IDEA 14 Community Edition?
                            
                                Folding a list of tuples using Scala
                            
                                Why it's impossible to override `var` with `def` in Scala?
                            
                                NoSuchMethodError when declaring a variable
                            
                                Spark: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
                            
                                Scala Set[_] vs Set[Any]
                            
                                How can I reduce the number of test cases ScalaCheck generates?
                            
                                Why implicit class in Scala must reside in another trait/class/object?
                            
                                How to invoke function of multiple arguments with HList?
                            
                                Efficiently traversing array matching a boolean condition?
                            
                                How to gracefully handle "Key not found" In a Scala map?
                            
                                how to close stream associated w/ iterator returned by Source.fromFile("/tmp/foo").getLines()
                            
                                Scala, pattern matching, strings
                            
                                How do I get the function name in Scala?
                            
                                Configure sbt to not utilize user home directory
                            
                                Scala: add items to a sequence or merge sequences conditionally
                            
                                Generic Adder from Idris to Scala?
                            
                                Spark SQL - Generate array of arrays from the sql function
                            
                                Spark Scala: retrieve the schema and store it
                            
                                How to write a DataFrame schema to file in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to do aggregation on multiple columns at once in Spark

Tags:

scala

apache-spark

HHH

People also ask

1 Answers

Daniel de Paula

Recent Activity

Donate For Us