Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Java - Merge same column multiple rows

I'm using Java Spark and I have 1 Dataframe like this

+---+-----+------+
|id |color|datas |
+----------------+
|1  |blue  |data1|
|1  |red   |data2|
|1  |orange|data3|
|2  |black |data4|
|2  |      |data5|
|2  |yellow|     |
|3  |white |data7|
|3  |      |data8|
+----------------+

I need to modify this dataframe to look like this :

+---+--------------------+---------------------+
|id |color               |datas                |
+----------------------------------------------+
|1  |[blue, red, orange] |[data1, data2, data3]|
|2  |[black, yellow]     |[data4, data5]       |
|3  |[white]             |[data7, data8]       |
+----------------------------------------------+

I want to merge the data to create an 'array' of the same column but from differents rows based on the 'id' column.

I'm able to do it throught UserDefinedAggregateFunction but I can only do it one column at a time and it takes too much time to process.

Thank you

Edit : I'm using Spark 1.6

like image 451
Lucien Avatar asked Apr 18 '26 23:04

Lucien


1 Answers

you can group by "id" and then use collect_list function to get the aggregated values.

dataframe.groupBy("id").agg(collect_list(struct("color")).as("color"), collect_list(struct("dates")).as("dates") )

Hope this helps

like image 89
koiralo Avatar answered Apr 20 '26 12:04

koiralo