Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataframe - How to keep only latest record for each group based on ID and Date? [duplicate]

I have a dataframe:

DF:

1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25

How to keep only latest record for each group? (there are 3 groups above (1,2,3)).

Result should be:

1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25

Trying also to make it efficient (e.g. to finish within few short minutes on a moderate cluster with 100 million records), so sorting/ordering should be done (if they are required) in most efficient and correct manner..

like image 781
Joe Avatar asked Jan 25 '23 09:01

Joe


1 Answers

You have to use the window function.

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window

you have to partition the window by the group and OrderBy time, below pyspark script do the work

from pyspark.sql.functions import *
from pyspark.sql.window import Window

schema = "Group int,time timestamp "

df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')


w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))

df.filter(df.Rank == 1).drop(df.Rank).show()


+-----+-------------------+
|Group|               time|
+-----+-------------------+
|    1|2016-11-18 14:47:05|
|    3|2016-10-12 17:24:25|
|    2|2016-10-12 22:24:25|
+-----+-------------------+ ```





like image 139
Ravi Avatar answered Feb 03 '23 06:02

Ravi