Spark: grouping rows in array by key

Tags:

I have a spark dataset like this one:

key id val1 val2 val3
1   a  a1   a2   a3
2   a  a4   a5   a6
3   b  b1   b2   b3
4   b  b4   b5   b6
5   b  b7   b8   b9
6   c  c1   c2   c3

I would like to group all rows by id in a list or array like this:

(a, ([1   a  a1   a2   a3], [2   a  a4   a5   a6]) ),
(b, ([3   b  b1   b2   b3], [4   b  b4   b5   b6], [5   b  b7   b8   b9]) ),
(c, ([6   c  c1   c2   c3]) )

I have used map to output key/value pairs with the right key but I have troubles in building the final key/array.

Can anybody help with that?

364

asked Feb 16 '17 11:02

Marco Tizzoni

1 Answers

how about this:

import org.apache.spark.sql.functions._
df.withColumn("combined",array("key","id","val1","val2","val3")).groupby("id").agg(collect_list($"combined"))

The Array function converts the columns into an array of column and then its a simple groupby with collect_list

137

answered Sep 29 '22 01:09

Assaf Mendelson

Related questions
                            
                                get size of parquet file in HDFS for repartition with Spark in Scala
                            
                                Scalatest custom matchers for 'should contain'
                            
                                DataFrame explode list of JSON objects
                            
                                Scala Slick filter and join
                            
                                Memory issue when importing parquet files in Spark
                            
                                How to transfer a float array (without serializing/deserializing) from Scala (JeroMQ) to C (ZMQ)?
                            
                                ScalaFX Button => How to define the action?
                            
                                Function literals vs function values
                            
                                Verify X-Hub-Signature from Facebook
                            
                                OneHotEncoder in Spark Dataframe in Pipeline
                            
                                Who can explain the meaning of this scala code
                            
                                Import different db drivers in Slick
                            
                                get the distinct elements of an ArrayType column in a spark dataframe
                            
                                Scala Nothing datatype
                            
                                How to use User Defined Types in Spark 2.0?
                            
                                Is `PartialFunction extends Function` a violation of LSP?
                            
                                Using typesafe config with Spark on Yarn
                            
                                How to make Reflection for getting the field value by its string name and its original type
                            
                                How to avoid boxing bytes in array in custom datasource?
                            
                                Scala – Make implicit value classes available in another scope

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: grouping rows in array by key

Tags:

scala

apache-spark

hadoop

Marco Tizzoni

People also ask

1 Answers

Assaf Mendelson

Recent Activity

Donate For Us