Merge rows in a spark Dataframe
I have data like following
ID Name Passport Country License UpdatedtimeStamp
1 Ostrich 12345 - ABC 11-02-2018
1 - - - BCD 10-02-2018
1 Shah 12345 - - 12-02-2018
2 PJ - ANB a 10-02-2018
Output required is
ID Name Passport Country License UpdatedtimeStamp
1 Shah 12345 - ABC 12-02-2018
2 PJ - ANB a 10-02-2018
Basically, Data in same ID
should merge, and latest updated and not null
record should be in the output, if all values are null
, then null
should be retained..
Please suggest... Also, suggest it without using SparkSQL Window
functions as i need it to be very fast
If you want to stay completely in sparkSQL
val df= Seq((1,Some("ostrich"), Some(12345), None, Some("ABC")," 11-02-2018" ),
(1,None, None, None, Some("BCD"), "10-02-2018"),(1,Some("Shah"), Some(12345), None,None, "12-02-2018"),
(2,Some("PJ"), None, Some("ANB"), Some("a"), "10-02-2018")).toDF("ID","Name","Passport","Country","License","UpdatedtimeStamp")
val df1= df.withColumn("date", to_date($"UpdatedtimeStamp","MM-dd-yyyy" )).drop($"UpdatedtimeStamp")
val win = Window.partitionBy("ID").orderBy($"date".desc)
val df2=df1.select($"*", row_number.over(win).as("r")).orderBy($"ID", $"r").drop("r")
val exprs= df2.columns.drop(1).map(x=>collect_list(x).as(x+"_grp"))
val df3=df2.groupBy("ID").agg(exprs.head,exprs.tail: _*)
val exprs2= df3.columns.drop(1).map(x=> col(x)(0).as(x))
df3.select((Array(col(df2.columns(0)))++exprs2): _*).show
+---+----+--------+-------+-------+----------+
| ID|Name|Passport|Country|License| date|
+---+----+--------+-------+-------+----------+
| 1|Shah| 12345| null| ABC|2018-12-02|
| 2| PJ| null| ANB| a|2018-10-02|
+---+----+--------+-------+-------+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With