Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the latest date from listed dates along with the total count?

I have the below DataFrame, it has keys with different dates out of which I would like to display latest date together with the count for each of the key-id pairs.

Input data as below:

id  key  date 
11  222  1/22/2017
11  222  1/22/2015
11  222  1/22/2016 
11  223  9/22/2017 
11  223  1/22/2010 
11  223  1/22/2008

Code I have tried:

val counts = df.groupBy($"id",$"key").count()

I am getting the below output,

id  key  count 
11  222   3
11  223   3

However, I want like the output to be as below:

id  key  count maxDate 
11  222   3    1/22/2017 
11  223   3    9/22/2017
like image 992
lak Avatar asked Nov 28 '25 07:11

lak


1 Answers

One way would be to transform the date into unixtime, do the aggregation and then convert it back again. This conversions to and from unixtime can be performed with unix_timestamp and from_unixtime respectively. When the date is in unixtime, the latest date can be selected by finding the maximum value. The only possible down-side of this approach is that the date format must be explicitly given.

val dateFormat = "MM/dd/yyyy"

val df2 = df.withColumn("date", unix_timestamp($"date", dateFormat))
  .groupBy($"id",$"key").agg(count("date").as("count"), max("date").as("maxDate"))
  .withColumn("maxDate", from_unixtime($"maxDate", dateFormat))

Which will give you:

+---+---+-----+----------+
| id|key|count|   maxDate|
+---+---+-----+----------+
| 11|222|    3|01/22/2017|
| 11|223|    3|09/22/2017|
+---+---+-----+----------+
like image 69
Shaido Avatar answered Nov 29 '25 22:11

Shaido



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!