Custom aggregations for Spark dataframes

Question

I was wondering if there is some way to specify a custom aggregation function for Spark dataframes. If I have a table with 2 columns id and value I would like to groupBy id and aggregate the values into a list for each value like so:

from:

john | tomato
john | carrot
bill | apple
john | banana
bill | taco

to:

john | tomato, carrot, banana
bill | apple, taco

Is this possible in dataframes? I am asking about dataframes because I am reading data as an orc file and it is loaded as a dataframe. I would think it is in-efficient to convert it to a RDD.

eliasah · Accepted Answer

I'd just go simply with the following :

import org.apache.spark.sql.functions.collect_list
val df = Seq(("john", "tomato"), ("john", "carrot"), 
             ("bill", "apple"), ("john", "banana"), 
             ("bill", "taco")).toDF("id", "value")
// df: org.apache.spark.sql.DataFrame = [id: string, value: string]

val aggDf = df.groupBy($"id").agg(collect_list($"value").as("values"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, values: array<string>]

aggDf.show(false)
// +----+------------------------+
// |id  |values                  |
// +----+------------------------+
// |john|[tomato, carrot, banana]|
// |bill|[apple, taco]           |
// +----+------------------------+

You won't even need to call the underlying rdd.

Custom aggregations for Spark dataframes

Tags:

group-by

aggregate-functions

scala

apache-spark

apache-spark-sql

anthonybell

1 Answers

eliasah

Recent Activity

Donate For Us

Custom aggregations for Spark dataframes

Tags:

group-by

aggregate-functions

scala

apache-spark

apache-spark-sql

anthonybell

1 Answers

eliasah

Related questions

Recent Activity

Donate For Us