Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I groupby and concat a list in a Dataframe Spark Scala

I have a dataframe with two columns with data as below

+----+-----------------+
|acct|           device|
+----+-----------------+
|   B|       List(3, 4)|
|   C|       List(3, 5)|
|   A|       List(2, 6)|
|   B|List(3, 11, 4, 9)|
|   C|       List(5, 6)|
|   A|List(2, 10, 7, 6)|
+----+-----------------+

And I need the result as below

+----+-----------------+
|acct|           device|
+----+-----------------+
|   B|List(3, 4, 11, 9)|
|   C|    List(3, 5, 6)|
|   A|List(2, 6, 7, 10)|
+----+-----------------+

I tried as below but ,it seems to be not working

df.groupBy("acct").agg(concat("device"))

df.groupBy("acct").agg(collect_set("device"))

Please let me know how can I achieve this using Scala?

like image 465
Babu Avatar asked May 08 '18 19:05

Babu


People also ask

How do I concatenate a Spark in a DataFrame?

Using concat() Function to Concatenate DataFrame Columns Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column. It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.

How do I get other columns with Spark DataFrame groupBy?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.

How do you do a group by in Spark?

Similar to SQL “GROUP BY” clause, Spark sql groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate functions like count(),min(),max,avg(),mean() on the grouped data.

How do I join multiple columns in Spark Scala?

Using Join syntax This join syntax takes, takes right dataset, joinExprs and joinType as arguments and we use joinExprs to provide join condition on multiple columns. This example joins emptDF DataFrame with deptDF DataFrame on multiple columns dept_id and branch_id columns using an inner join.


1 Answers

You can start by exploding the device column and continue as you did - but note that it might not preserve the order of the lists (which anyway isn't guaranteed in any group by):

val result = df.withColumn("device", explode($"device"))
  .groupBy("acct")
  .agg(collect_set("device"))

result.show(truncate = false)
// +----+-------------------+
// |acct|collect_set(device)|
// +----+-------------------+
// |B   |[9, 3, 4, 11]      |
// |C   |[5, 6, 3]          |
// |A   |[2, 6, 10, 7]      |
// +----+-------------------+
like image 164
Tzach Zohar Avatar answered Sep 22 '22 17:09

Tzach Zohar