Window Function Tie breaker on other field to get the Latest Record

Question

I have following data, where the data is partitioned by the stores and month id and ordered by amount in order to get the primary vendor for the store.

I need a tie breaker if the amount is equal between two vendors, then if one of the tied vendor was the previous months most sales vendor, make that vendor as the most sales vendor for the month.

The look back will increase if there is a tie again. Lag of 1 month will not work if there is tie again. Worst case scenario we will have more duplicates in previous month also.

sample data

val data = Seq((201801,      10941,            115,  80890.44900, 135799.66400),
               (201801,      10941,            3,  80890.44900, 135799.66400) ,
               (201712,      10941,            3, 517440.74500, 975893.79000),
               (201712,      10941,            115, 517440.74500, 975893.79000),
               (201711,      10941,            3 , 371501.92100, 574223.52300),
               (201710,      10941,            115, 552435.57800, 746912.06700),
               (201709,      10941,            115,1523492.60700,1871480.06800),
               (201708,      10941,            115,1027698.93600,1236544.50900),
               (201707,      10941,            33 ,1469219.86900,1622949.53000)
               ).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")

Code:

val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)

Output:

    +------+--------+-----+-----------+-----------+----+
 |MTH_ID|store_id|brand|  brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801|   10941|  115|  80890.449| 135799.664|   1|
|201801|   10941|    3|  80890.449| 135799.664|   1|
|201712|   10941|    3| 517440.745|  975893.79|   1|
|201712|   10941|  115| 517440.745|  975893.79|   1|
|201711|   10941|  115| 371501.921| 574223.523|   1|
|201710|   10941|  115| 552435.578| 746912.067|   1|
|201709|   10941|  115|1523492.607|1871480.068|   1|
|201708|   10941|  115|1027698.936|1236544.509|   1|
|201707|   10941|   33|1469219.869| 1622949.53|   1|
+------+--------+-----+-----------+-----------+----+

My rank is 1 for both 1 and 2 records, but my rank should be 1 for second record based on previous month max dollars

I am expecting the following output.

    +------+--------+-----+-----------+-----------+----+
    |MTH_ID|store_id|brand|  brndSales| TotalSales|rank|
    +------+--------+-----+-----------+-----------+----+
    |201801|   10941|  115|  80890.449| 135799.664|   2|
    |201801|   10941|    3|  80890.449| 135799.664|   1|
    |201712|   10941|    3| 517440.745|  975893.79|   1|
    |201712|   10941|  115| 517440.745|  975893.79|   1|
    |201711|   10941|    3| 371501.921| 574223.523|   1|
    |201710|   10941|  115| 552435.578| 746912.067|   1|
    |201709|   10941|  115|1523492.607|1871480.068|   1|
    |201708|   10941|  115|1027698.936|1236544.509|   1|
    |201707|   10941|   33|1469219.869| 1622949.53|   1|
    +------+--------+-----+-----------+-----------+----+

Should I write a UDAF? Any suggestions would help.

David Schuler · Accepted Answer

You can do this with 2 windows. First, you will need to use the lag() function to carry over the previous month's sales values so that you can use that in your rank window. here's that part in pyspark:

lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))

Then edit your window to include that new column:

window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand|  brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711|   10941|   99| 371501.921| 574223.523|            null|   1|
|201709|   10941|  115|1523492.607|1871480.068|     1027698.936|   1|
|201707|   10941|   33|1469219.869| 1622949.53|            null|   1|
|201708|   10941|  115|1027698.936|1236544.509|            null|   1|
|201710|   10941|  115| 552435.578| 746912.067|     1523492.607|   1|
|201712|   10941|    3| 517440.745|  975893.79|            null|   1|
|201801|   10941|    3|  80890.449| 135799.664|      517440.745|   1|
|201801|   10941|  115|  80890.449| 135799.664|      552435.578|   2|
+------+--------+-----+-----------+-----------+----------------+----+

Window Function Tie breaker on other field to get the Latest Record

Tags:

sql

apache-spark

apache-spark-sql

pyspark

pyspark-sql

loneStar

1 Answers

David Schuler

Recent Activity

Donate For Us

Window Function Tie breaker on other field to get the Latest Record

Tags:

sql

apache-spark

apache-spark-sql

pyspark

pyspark-sql

loneStar

1 Answers

David Schuler

Related questions

Recent Activity

Donate For Us