I'm trying to round hours using pyspark and udf.
The function works properly in python but not well when using pyspark.
The input is :
date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp
def time_feature_creation_spark(date):
return date.round("H").hour
time_feature_creation_udf = udf(lambda x : time_feature_creation_spark(x), IntegerType())

Then I use it in the function that feeds spark :
data = data.withColumn("hour", time_feature_creation_udf(data["date"])
And the error is :
TypeError: 'Column' object is not callable
The expected output is just the closest hour from the time in the datetime (e.g. 20h45 is closest to 21h, so returns 21)
A nicer version than /3600*3600 is using the built-in function date_trunc
import pyspark.sql.functions as F
return df.withColumn("hourly_timestamp", F.date_trunc("hour", df.timestamp))
other formats besides hour are
year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’, ‘minute’, ‘second’, ‘week’, ‘quarter’
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With