Find minimum for a timestamp through Spark groupBy dataframe

Question

When I try to group my dataframe on a column then try to find the minimum for each grouping groupbyDatafram.min('timestampCol') it appears I cannot do it on non numerical columns. Then how can I properly filter the minimum (earliest) date on the groupby?

I am streaming the dataframe from a postgresql S3 instance, so that data is already configured.

zero323 · Accepted Answer

Just perform aggregation directly instead of using min helper:

import org.apache.spark.sql.functions.min

val sqlContext: SQLContext = ???

import sqlContext.implicits._

val df = Seq((1L, "2016-04-05 15:10:00"), (1L, "2014-01-01 15:10:00"))
  .toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))

df.groupBy($"id").agg(min($"ts")).show

// +---+--------------------+
// | id|             min(ts)|
// +---+--------------------+
// |  1|2014-01-01 15:10:...|
// +---+--------------------+

Unlike min it will work on any Orderable type.

Find minimum for a timestamp through Spark groupBy dataframe

Tags:

sql

scala

apache-spark

apache-spark-sql

Jake Fund

1 Answers

zero323

Recent Activity

Donate For Us

Find minimum for a timestamp through Spark groupBy dataframe

Tags:

sql

scala

apache-spark

apache-spark-sql

Jake Fund

1 Answers

zero323

Related questions

Recent Activity

Donate For Us