Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

select latest record from spark dataframe

i have DataDrame looks like this:

+-------+---------+
|email  |timestamp|
+-------+---------+
|[email protected]|        1|
|[email protected]|        2|
|[email protected]|        3|
|[email protected]|        4|
|[email protected]|        5|
|    .. |       ..|
+-------+---------+

for each email i want to keep the latest record, so the result would be:

+-------+---------+
|email  |timestamp|
+-------+---------+
|[email protected]|        4|
|[email protected]|        5|
|[email protected]|        3|
|    .. |       ..|
+-------+---------+

how can I do that? i'm new to spark and dataframe.

like image 434
user468587 Avatar asked Apr 10 '19 14:04

user468587


1 Answers

Here is a general ANSI SQL query which should work with Spark SQL:

SELECT email, timestamp
FROM
(
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY email ORDER BY timestamp DESC) rn
    FROM yourTable t
) t
WHERE rn = 1;

For PySpark data frame code, try the following:

from pyspark.sql.window import Window

df = yourDF
    .withColumn("rn", F.row_number()
        .over(Window.partitionBy("email")
        .orderBy(F.col("timestamp").desc())))

df = df.filter(F.col("rn") == 1).drop("rn")
df.show()
like image 122
Tim Biegeleisen Avatar answered Jan 03 '23 05:01

Tim Biegeleisen