Find previous month sale of each city from Spark Data frame
|City| Month |Sale|
+----+----------- +----- +
| c1| JAN-2017| 49 |
| c1| FEB-2017| 46 |
| c1| MAR-2017| 83 |
| c2| JAN-2017| 59 |
| c2| MAY-2017| 60 |
| c2| JUN-2017| 49 |
| c2| JUL-2017| 73 |
+----+-----+----+-------
Required solution is
|City| Month |Sale |previous_sale|
+----+-----+-------+-------------+--------
| c1| JAN-2017| 49| NULL |
| c1| FEB-2017| 46| 49 |
| c1| MAR-2017| 83| 46 |
| c2| JAN-2017| 59| NULL |
| c2| MAY-2017| 60| 59 |
| c2| JUN-2017| 49| 60 |
| c2| JUL-2017| 73| 49 |
+----+-----+----+-------------+-----------
Please help me
You can use lag function to get the previous value
If you want to sort by month you need to convert to proper date. For "JAN-2017"
to "01-01-2017"
something like this.
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(
("c1", "JAN-2017", 49),
("c1", "FEB-2017", 46),
("c1", "MAR-2017", 83),
("c2", "JAN-2017", 59),
("c2", "MAY-2017", 60),
("c2", "JUN-2017", 49),
("c2", "JUL-2017", 73)
)).toDF("city", "month", "sales")
val window = Window.partitionBy("city").orderBy("month")
df.withColumn("previous_sale", lag($"sales", 1, null).over(window)).show
Output:
+----+--------+-----+----+
|city| month|sales| previous_sale|
+----+--------+-----+----+
| c1|FEB-2017| 46|null|
| c1|JAN-2017| 49| 46|
| c1|MAR-2017| 83| 49|
| c2|JAN-2017| 59|null|
| c2|JUL-2017| 73| 59|
| c2|JUN-2017| 49| 73|
| c2|MAY-2017| 60| 49|
+----+--------+-----+----+
You can use this UDF to create a default date like 01/month/year which will be used so sort with date even if it has different year
val fullDate = udf((value :String )=>
{
val months = List("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")
val splited = value.split("-")
new Date(splited(1).toInt, months.indexOf(splited(0)) + 1, 1)
})
df.withColumn("month", fullDate($"month")).show()
Hope this hepls!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With