When I try to group my dataframe on a column then try to find the minimum for each grouping groupbyDatafram.min('timestampCol')
it appears I cannot do it on non numerical columns. Then how can I properly filter the minimum (earliest) date on the groupby?
I am streaming the dataframe from a postgresql S3 instance, so that data is already configured.
Just perform aggregation directly instead of using min
helper:
import org.apache.spark.sql.functions.min
val sqlContext: SQLContext = ???
import sqlContext.implicits._
val df = Seq((1L, "2016-04-05 15:10:00"), (1L, "2014-01-01 15:10:00"))
.toDF("id", "ts")
.withColumn("ts", $"ts".cast("timestamp"))
df.groupBy($"id").agg(min($"ts")).show
// +---+--------------------+
// | id| min(ts)|
// +---+--------------------+
// | 1|2014-01-01 15:10:...|
// +---+--------------------+
Unlike min
it will work on any Orderable
type.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With