I have a dataframe which includes timestamp
. To aggregate by time(minute, hour, or day), I have tried as:
val toSegment = udf((timestamp: String) => {
val asLong = timestamp.toLong
asLong - asLong % 3600000 // period = 1 hour
})
val df: DataFrame // the dataframe
df.groupBy(toSegment($"timestamp")).count()
This works fine.
My question is how to generalize the UDF toSegment
as
val toSegmentGeneralized = udf((timestamp: String, period: Int) => {
val asLong = timestamp.toLong
asLong - asLong % period
})
I have tried as follows but it doesn't work
df.groupBy(toSegment($"timestamp", $"3600000")).count()
It seems to find the column named 3600000
.
Possible solution is to use constant column but I couldn't find it.
You can use org.apache.spark.sql.functions.lit()
to create the constant column:
import org.apache.spark.sql.functions._
df.groupBy(toSegment($"timestamp", lit(3600000))).count()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With