Aggregating multiple columns with custom function in Spark

Tags:

I was wondering if there is some way to specify a custom aggregation function for spark dataframes over multiple columns.

I have a table like this of the type (name, item, price):

john | tomato | 1.99 john | carrot | 0.45 bill | apple  | 0.99 john | banana | 1.29 bill | taco   | 2.59

to:

I would like to aggregate the item and it's cost for each person into a list like this:

john | (tomato, 1.99), (carrot, 0.45), (banana, 1.29) bill | (apple, 0.99), (taco, 2.59)

Is this possible in dataframes? I recently learned about collect_list but it appears to only work for one column.

549

asked Jun 09 '16 23:06

anthonybell

2 Answers

Consider using the struct function to group the columns together before collecting as a list:

import org.apache.spark.sql.functions.{collect_list, struct} import sqlContext.implicits._  val df = Seq(   ("john", "tomato", 1.99),   ("john", "carrot", 0.45),   ("bill", "apple", 0.99),   ("john", "banana", 1.29),   ("bill", "taco", 2.59) ).toDF("name", "food", "price")  df.groupBy($"name")   .agg(collect_list(struct($"food", $"price")).as("foods"))   .show(false)

Outputs:

+----+---------------------------------------------+ |name|foods                                        | +----+---------------------------------------------+ |john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]| |bill|[[apple,0.99], [taco,2.59]]                  | +----+---------------------------------------------+

125

answered Oct 04 '22 18:10

Daniel Siegmann

The easiest way to do this as a DataFrame is to first collect two lists, and then use a UDF to zip the two lists together. Something like:

import org.apache.spark.sql.functions.{collect_list, udf} import sqlContext.implicits._  val zipper = udf[Seq[(String, Double)], Seq[String], Seq[Double]](_.zip(_))  val df = Seq(   ("john", "tomato", 1.99),   ("john", "carrot", 0.45),   ("bill", "apple", 0.99),   ("john", "banana", 1.29),   ("bill", "taco", 2.59) ).toDF("name", "food", "price")  val df2 = df.groupBy("name").agg(   collect_list(col("food")) as "food",   collect_list(col("price")) as "price"  ).withColumn("food", zipper(col("food"), col("price"))).drop("price")  df2.show(false) # +----+---------------------------------------------+ # |name|food                                         | # +----+---------------------------------------------+ # |john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]| # |bill|[[apple,0.99], [taco,2.59]]                  | # +----+---------------------------------------------+

answered Oct 04 '22 17:10

David Griffin

Related questions
                            
                                How to resolve the AnalysisException: resolved attribute(s) in Spark
                            
                                Scala - extends vs with
                            
                                Un-optioning an optioned Option
                            
                                Scala - infix vs dot notation
                            
                                Scala convert Iterable or collection.Seq to collection.immutable.Seq
                            
                                What is a TrieMap and what is its advantages/disadvantages compared to a HashMap?
                            
                                How do I expose Scala constructor arguments as public members?
                            
                                How to cast Long to Int in Scala?
                            
                                What are the Spark transformations that causes a Shuffle?
                            
                                Function parameter types and =>
                            
                                Scala foreach strange behaviour
                            
                                How to set hadoop configuration values from pyspark
                            
                                How to set amount of Spark executors?
                            
                                How can I pattern match on a range in Scala?
                            
                                Increment for-loop by 2 in Scala
                            
                                How to define an Ordering in Scala?
                            
                                Why Some(null) isn't considered None?
                            
                                Most elegant repeat loop in Scala
                            
                                Scala maps -> operator
                            
                                Capitalize the first letter of every word in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Aggregating multiple columns with custom function in Spark

Tags:

dataframe

scala

apache-spark

apache-spark-sql

orc

anthonybell

People also ask

2 Answers

Daniel Siegmann

David Griffin

Recent Activity

Donate For Us