Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe: Pivot and Group based on columns

I have input dataframe as below with id, app, and customer

Input dataframe

+--------------------+-----+---------+
|                  id|app  |customer |
+--------------------+-----+---------+
|id1                 |   fw|     WM  |
|id1                 |   fw|     CS  |
|id2                 |   fw|     CS  |
|id1                 |   fe|     WM  |
|id3                 |   bc|     TR  |
|id3                 |   bc|     WM  |
+--------------------+-----+---------+

Expected output

Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe

Expected dataframe

+--------------------+----------+-------+----------+
|                  id|   bc     |     fe|    fw    |
+--------------------+----------+-------+----------+
|id1                 |  0       |     WM|   [WM,CS]|
|id2                 |  0       |      0|   [CS]   |
|id3                 | [TR,WM]  |      0|      0   |
+--------------------+----------+-------+----------+

What have i tried ?

val newDF = df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()

+--------------------+-----+-------+------+
|                  id|bc   |     fe|    fw|
+--------------------+-----+-------+------+
|id1                 |  0  |     WM|    WM|
|id2                 |  0  |      0|    CS|
|id3                 | TR  |      0|     0|
+--------------------+-----+-------+------+

Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"

Need your suggestion to get the list of customer under each app respectively.

like image 366
Debaditya Avatar asked Jan 29 '23 14:01

Debaditya


1 Answers

You can use collect_list if you can bear with an empty List at cells where it should be zero:

df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id|      bc|  fe|      fw|
+---+--------+----+--------+
|id3|[TR, WM]|  []|      []|
|id1|      []|[WM]|[CS, WM]|
|id2|      []|  []|    [CS]|
+---+--------+----+--------+
like image 146
Psidom Avatar answered Feb 05 '23 01:02

Psidom