Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Window Function Tie breaker on other field to get the Latest Record

I have following data, where the data is partitioned by the stores and month id and ordered by amount in order to get the primary vendor for the store.

I need a tie breaker if the amount is equal between two vendors, then if one of the tied vendor was the previous months most sales vendor, make that vendor as the most sales vendor for the month.

The look back will increase if there is a tie again. Lag of 1 month will not work if there is tie again. Worst case scenario we will have more duplicates in previous month also.

sample data

val data = Seq((201801,      10941,            115,  80890.44900, 135799.66400),
               (201801,      10941,            3,  80890.44900, 135799.66400) ,
               (201712,      10941,            3, 517440.74500, 975893.79000),
               (201712,      10941,            115, 517440.74500, 975893.79000),
               (201711,      10941,            3 , 371501.92100, 574223.52300),
               (201710,      10941,            115, 552435.57800, 746912.06700),
               (201709,      10941,            115,1523492.60700,1871480.06800),
               (201708,      10941,            115,1027698.93600,1236544.50900),
               (201707,      10941,            33 ,1469219.86900,1622949.53000)
               ).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")

Code:

val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)

Output:

    +------+--------+-----+-----------+-----------+----+
 |MTH_ID|store_id|brand|  brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801|   10941|  115|  80890.449| 135799.664|   1|
|201801|   10941|    3|  80890.449| 135799.664|   1|
|201712|   10941|    3| 517440.745|  975893.79|   1|
|201712|   10941|  115| 517440.745|  975893.79|   1|
|201711|   10941|  115| 371501.921| 574223.523|   1|
|201710|   10941|  115| 552435.578| 746912.067|   1|
|201709|   10941|  115|1523492.607|1871480.068|   1|
|201708|   10941|  115|1027698.936|1236544.509|   1|
|201707|   10941|   33|1469219.869| 1622949.53|   1|
+------+--------+-----+-----------+-----------+----+

My rank is 1 for both 1 and 2 records, but my rank should be 1 for second record based on previous month max dollars

I am expecting the following output.

    +------+--------+-----+-----------+-----------+----+
    |MTH_ID|store_id|brand|  brndSales| TotalSales|rank|
    +------+--------+-----+-----------+-----------+----+
    |201801|   10941|  115|  80890.449| 135799.664|   2|
    |201801|   10941|    3|  80890.449| 135799.664|   1|
    |201712|   10941|    3| 517440.745|  975893.79|   1|
    |201712|   10941|  115| 517440.745|  975893.79|   1|
    |201711|   10941|    3| 371501.921| 574223.523|   1|
    |201710|   10941|  115| 552435.578| 746912.067|   1|
    |201709|   10941|  115|1523492.607|1871480.068|   1|
    |201708|   10941|  115|1027698.936|1236544.509|   1|
    |201707|   10941|   33|1469219.869| 1622949.53|   1|
    +------+--------+-----+-----------+-----------+----+

Should I write a UDAF? Any suggestions would help.

like image 909
loneStar Avatar asked Oct 15 '22 14:10

loneStar


1 Answers

You can do this with 2 windows. First, you will need to use the lag() function to carry over the previous month's sales values so that you can use that in your rank window. here's that part in pyspark:

lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))

Then edit your window to include that new column:

window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand|  brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711|   10941|   99| 371501.921| 574223.523|            null|   1|
|201709|   10941|  115|1523492.607|1871480.068|     1027698.936|   1|
|201707|   10941|   33|1469219.869| 1622949.53|            null|   1|
|201708|   10941|  115|1027698.936|1236544.509|            null|   1|
|201710|   10941|  115| 552435.578| 746912.067|     1523492.607|   1|
|201712|   10941|    3| 517440.745|  975893.79|            null|   1|
|201801|   10941|    3|  80890.449| 135799.664|      517440.745|   1|
|201801|   10941|  115|  80890.449| 135799.664|      552435.578|   2|
+------+--------+-----+-----------+-----------+----------------+----+
like image 121
David Schuler Avatar answered Oct 19 '22 12:10

David Schuler