I have a use case where I'd need to drop duplicate rows of a dataframe (in this case duplicate means they have the same 'id' field) while keeping the row with the highest 'timestamp' (unix timestamp) field.
I found the drop_duplicate method (I'm using pyspark), but one don't have control on which item will be kept.
Anyone can help ? Thx in advance
A manual map and reduce might be needed to provide the functionality you want.
def selectRowByTimeStamp(x,y):
if x.timestamp > y.timestamp:
return x
return y
dataMap = data.map(lambda x: (x.id, x))
uniqueData = dataMap.reduceByKey(selectRowByTimeStamp)
Here we are grouping all of the data based on id. Then, when we are reducing the groupings, we do so by keeping the record with the highest timestamp. When the code is done reducing, only 1 record will be left for each id.
You can do something like this:
val df = Seq(
(1,12345678,"this is a test"),
(1,23456789, "another test"),
(2,2345678,"2nd test"),
(2,1234567, "2nd another test")
).toDF("id","timestamp","data")
+---+---------+----------------+
| id|timestamp| data|
+---+---------+----------------+
| 1| 12345678| this is a test|
| 1| 23456789| another test|
| 2| 2345678| 2nd test|
| 2| 1234567|2nd another test|
+---+---------+----------------+
df.join(
df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"),
$"id" === $"r_id" && $"timestamp" === $"r_timestamp"
).drop("r_id").drop("r_timestamp").show
+---+---------+------------+
| id|timestamp| data|
+---+---------+------------+
| 1| 23456789|another test|
| 2| 2345678| 2nd test|
+---+---------+------------+
If you expect there could be a repeated timestamp
for an id
(see comments below), you could do this:
df.dropDuplicates(Seq("id", "timestamp")).join(
df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"),
$"id" === $"r_id" && $"timestamp" === $"r_timestamp"
).drop("r_id").drop("r_timestamp").show
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With