Merge rows in a spark scala Dataframe

Question

Merge rows in a spark Dataframe

I have data like following

ID  Name    Passport    Country  License    UpdatedtimeStamp
1   Ostrich 12345       -       ABC         11-02-2018
1   -       -           -       BCD         10-02-2018
1   Shah    12345       -       -           12-02-2018
2   PJ      -           ANB     a           10-02-2018

Output required is

ID  Name    Passport    Country  License    UpdatedtimeStamp
1   Shah    12345       -       ABC         12-02-2018
2   PJ      -           ANB     a           10-02-2018

Basically, Data in same ID should merge, and latest updated and not null record should be in the output, if all values are null, then null should be retained..

Please suggest... Also, suggest it without using SparkSQL Window functions as i need it to be very fast

mikeL · Accepted Answer

If you want to stay completely in sparkSQL

val df= Seq((1,Some("ostrich"), Some(12345), None, Some("ABC")," 11-02-2018" ),
(1,None, None, None, Some("BCD"), "10-02-2018"),(1,Some("Shah"), Some(12345), None,None, "12-02-2018"),
(2,Some("PJ"), None, Some("ANB"), Some("a"), "10-02-2018")).toDF("ID","Name","Passport","Country","License","UpdatedtimeStamp")


val df1= df.withColumn("date", to_date($"UpdatedtimeStamp","MM-dd-yyyy" )).drop($"UpdatedtimeStamp")

val win = Window.partitionBy("ID").orderBy($"date".desc)

val df2=df1.select($"*", row_number.over(win).as("r")).orderBy($"ID", $"r").drop("r")
val exprs= df2.columns.drop(1).map(x=>collect_list(x).as(x+"_grp"))

val df3=df2.groupBy("ID").agg(exprs.head,exprs.tail: _*)

val exprs2= df3.columns.drop(1).map(x=> col(x)(0).as(x))

df3.select((Array(col(df2.columns(0)))++exprs2): _*).show


+---+----+--------+-------+-------+----------+
| ID|Name|Passport|Country|License|      date|
+---+----+--------+-------+-------+----------+
|  1|Shah|   12345|   null|    ABC|2018-12-02|
|  2|  PJ|    null|    ANB|      a|2018-10-02|
+---+----+--------+-------+-------+----------+

Merge rows in a spark scala Dataframe

Tags:

dataframe

scala

apache-spark

Darshan Shah

1 Answers

mikeL

Recent Activity

Donate For Us

Merge rows in a spark scala Dataframe

Tags:

dataframe

scala

apache-spark

Darshan Shah

1 Answers

mikeL

Related questions

Recent Activity

Donate For Us