I have following data, where the data is partitioned by the stores and month id and ordered by amount in order to get the primary vendor for the store.
I need a tie breaker if the amount is equal between two vendors, then if one of the tied vendor was the previous months most sales vendor, make that vendor as the most sales vendor for the month.
The look back will increase if there is a tie again. Lag of 1 month will not work if there is tie again. Worst case scenario we will have more duplicates in previous month also.
sample data
val data = Seq((201801, 10941, 115, 80890.44900, 135799.66400),
(201801, 10941, 3, 80890.44900, 135799.66400) ,
(201712, 10941, 3, 517440.74500, 975893.79000),
(201712, 10941, 115, 517440.74500, 975893.79000),
(201711, 10941, 3 , 371501.92100, 574223.52300),
(201710, 10941, 115, 552435.57800, 746912.06700),
(201709, 10941, 115,1523492.60700,1871480.06800),
(201708, 10941, 115,1027698.93600,1236544.50900),
(201707, 10941, 33 ,1469219.86900,1622949.53000)
).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")
Code:
val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)
Output:
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 1|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 115| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
My rank is 1 for both 1 and 2 records, but my rank should be 1 for second record based on previous month max dollars
I am expecting the following output.
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 2|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 3| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
Should I write a UDAF? Any suggestions would help.
You can do this with 2 windows. First, you will need to use the lag() function to carry over the previous month's sales values so that you can use that in your rank window. here's that part in pyspark:
lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))
Then edit your window to include that new column:
window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711| 10941| 99| 371501.921| 574223.523| null| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1027698.936| 1|
|201707| 10941| 33|1469219.869| 1622949.53| null| 1|
|201708| 10941| 115|1027698.936|1236544.509| null| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1523492.607| 1|
|201712| 10941| 3| 517440.745| 975893.79| null| 1|
|201801| 10941| 3| 80890.449| 135799.664| 517440.745| 1|
|201801| 10941| 115| 80890.449| 135799.664| 552435.578| 2|
+------+--------+-----+-----------+-----------+----------------+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With