I have the below DataFrame, it has keys with different dates out of which I would like to display latest date together with the count for each of the key-id pairs.
Input data as below:
id key date
11 222 1/22/2017
11 222 1/22/2015
11 222 1/22/2016
11 223 9/22/2017
11 223 1/22/2010
11 223 1/22/2008
Code I have tried:
val counts = df.groupBy($"id",$"key").count()
I am getting the below output,
id key count
11 222 3
11 223 3
However, I want like the output to be as below:
id key count maxDate
11 222 3 1/22/2017
11 223 3 9/22/2017
One way would be to transform the date into unixtime, do the aggregation and then convert it back again. This conversions to and from unixtime can be performed with unix_timestamp and from_unixtime respectively. When the date is in unixtime, the latest date can be selected by finding the maximum value. The only possible down-side of this approach is that the date format must be explicitly given.
val dateFormat = "MM/dd/yyyy"
val df2 = df.withColumn("date", unix_timestamp($"date", dateFormat))
.groupBy($"id",$"key").agg(count("date").as("count"), max("date").as("maxDate"))
.withColumn("maxDate", from_unixtime($"maxDate", dateFormat))
Which will give you:
+---+---+-----+----------+
| id|key|count| maxDate|
+---+---+-----+----------+
| 11|222| 3|01/22/2017|
| 11|223| 3|09/22/2017|
+---+---+-----+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With