How to get the N most recent dates in Pyspark

Question

Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this

| Grouping | Bucket | Date      |
|----------|--------|-----------|
| 1        | 1      | 9-20-2020 |
| 1        | 1      | 9-16-2020 |
| 1        | 1      | 8-14-2020 |
| 1        | 2      | 9-20-2020 |
| 1        | 2      | 9-16-2020 |
| 1        | 2      | 8-14-2020 |
| 2        | 1      | 9-20-2020 |
| 2        | 1      | 9-16-2020 |
| 2        | 1      | 8-14-2020 |

Would turn into this:

| Grouping | Bucket | Date      |
|----------|--------|-----------|
| 1        | 1      | 9-20-2020 |
| 1        | 1      | 9-16-2020 |
| 1        | 2      | 9-20-2020 |
| 1        | 2      | 9-16-2020 |
| 2        | 1      | 9-20-2020 |
| 2        | 1      | 9-16-2020 |

Edit: I reviewed my question after edit and realized that not doing the edit to begin with was the right choice

werner · Accepted Answer

Use a window and take the top two ranks within each window:

df.withColumn("r", F.rank().over(Window.partitionBy("Grouping", "Bucket") \
                           .orderBy(F.col("Date").desc()))) \
    .filter("r <=2") \
    .drop("r") \
    .show()

Output:

+--------+------+----------+
|Grouping|Bucket|      Date|
+--------+------+----------+
|       1|     1|2020-09-20|
|       1|     1|2020-09-16|
|       1|     2|2020-09-20|
|       1|     2|2020-09-16|
|       2|     1|2020-09-20|
|       2|     1|2020-09-16|
+--------+------+----------+

Edit: this answer applies to this revision (get the most recent N records for each group).

How to get the N most recent dates in Pyspark

Tags:

python

apache-spark

apache-spark-sql

pyspark

user7298979

1 Answers

werner

Recent Activity

Donate For Us

How to get the N most recent dates in Pyspark

Tags:

python

apache-spark

apache-spark-sql

pyspark

user7298979

1 Answers

werner

Related questions

Recent Activity

Donate For Us