i have DataDrame looks like this:
+-------+---------+
|email |timestamp|
+-------+---------+
|[email protected]| 1|
|[email protected]| 2|
|[email protected]| 3|
|[email protected]| 4|
|[email protected]| 5|
| .. | ..|
+-------+---------+
for each email i want to keep the latest record, so the result would be:
+-------+---------+
|email |timestamp|
+-------+---------+
|[email protected]| 4|
|[email protected]| 5|
|[email protected]| 3|
| .. | ..|
+-------+---------+
how can I do that? i'm new to spark and dataframe.
Here is a general ANSI SQL query which should work with Spark SQL:
SELECT email, timestamp
FROM
(
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY email ORDER BY timestamp DESC) rn
FROM yourTable t
) t
WHERE rn = 1;
For PySpark data frame code, try the following:
from pyspark.sql.window import Window
df = yourDF
.withColumn("rn", F.row_number()
.over(Window.partitionBy("email")
.orderBy(F.col("timestamp").desc())))
df = df.filter(F.col("rn") == 1).drop("rn")
df.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With