How to get the latest date from listed dates along with the total count?

Question

I have the below DataFrame, it has keys with different dates out of which I would like to display latest date together with the count for each of the key-id pairs.

Input data as below:

id  key  date 
11  222  1/22/2017
11  222  1/22/2015
11  222  1/22/2016 
11  223  9/22/2017 
11  223  1/22/2010 
11  223  1/22/2008

Code I have tried:

val counts = df.groupBy($"id",$"key").count()

I am getting the below output,

id  key  count 
11  222   3
11  223   3

However, I want like the output to be as below:

id  key  count maxDate 
11  222   3    1/22/2017 
11  223   3    9/22/2017

Shaido · Accepted Answer

One way would be to transform the date into unixtime, do the aggregation and then convert it back again. This conversions to and from unixtime can be performed with unix_timestamp and from_unixtime respectively. When the date is in unixtime, the latest date can be selected by finding the maximum value. The only possible down-side of this approach is that the date format must be explicitly given.

val dateFormat = "MM/dd/yyyy"

val df2 = df.withColumn("date", unix_timestamp($"date", dateFormat))
  .groupBy($"id",$"key").agg(count("date").as("count"), max("date").as("maxDate"))
  .withColumn("maxDate", from_unixtime($"maxDate", dateFormat))

Which will give you:

+---+---+-----+----------+
| id|key|count|   maxDate|
+---+---+-----+----------+
| 11|222|    3|01/22/2017|
| 11|223|    3|09/22/2017|
+---+---+-----+----------+

How to get the latest date from listed dates along with the total count?

Tags:

scala

apache-spark

apache-spark-sql

lak

1 Answers

Shaido

Recent Activity

Donate For Us

How to get the latest date from listed dates along with the total count?

Tags:

scala

apache-spark

apache-spark-sql

lak

1 Answers

Shaido

Related questions

Recent Activity

Donate For Us