I have a spark dataset like this one:
key id val1 val2 val3
1 a a1 a2 a3
2 a a4 a5 a6
3 b b1 b2 b3
4 b b4 b5 b6
5 b b7 b8 b9
6 c c1 c2 c3
I would like to group all rows by id in a list or array like this:
(a, ([1 a a1 a2 a3], [2 a a4 a5 a6]) ),
(b, ([3 b b1 b2 b3], [4 b b4 b5 b6], [5 b b7 b8 b9]) ),
(c, ([6 c c1 c2 c3]) )
I have used map to output key/value pairs with the right key but I have troubles in building the final key/array.
Can anybody help with that?
In Spark, the groupByKey function is a frequently used transformation operation that performs shuffling of data. It receives key-value pairs (K, V) as an input, group the values based on key and generates a dataset of (K, Iterable ) pairs as an output.
In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..)
Similar to SQL “GROUP BY” clause, Spark sql groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate functions like count(),min(),max,avg(),mean() on the grouped data.
RelationalGroupedDataset is an interface to calculate aggregates over groups of rows in a DataFrame. Note. KeyValueGroupedDataset is used for typed aggregates over groups of custom Scala objects (not Rows). RelationalGroupedDataset is a result of executing the following grouping operators: groupBy.
how about this:
import org.apache.spark.sql.functions._
df.withColumn("combined",array("key","id","val1","val2","val3")).groupby("id").agg(collect_list($"combined"))
The Array function converts the columns into an array of column and then its a simple groupby with collect_list
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With