Using pyspark, I have a data frame with two TimestampType
columns:
df.schema
...StructField(session_start,TimestampType,true),StructField(session_end,TimestampType,true)...
But I can't figure out how to compute the difference:
df2 = df.withColumn("session_length",col("session_end")-col("session_start"))
gives me
AnalysisException: u"cannot resolve '(`session_end` - `session_start`)' due to data type mismatch: '(`session_end` - `session_start`)' requires (numeric or calendarinterval) type, not timestamp ...
I haven't found an alternative that works. (There's a datediff
function, but that returns a result in days, and I need the difference in seconds.)
How should I write this?
EDIT: The original version of this question had a different error caused by reusing col
as a variable in my notebook. After redoing the import
to get the function back, I now get the above AnalysisException
.
You should use pyspark.sql.functions.unix_timestamp
to convert your columns, then take the difference:
from pyspark.sql.functions import unix_timestamp
df2 = df.withColumn(
"session_length",
unix_timestamp("session_end") - unix_timestamp("session_start")
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With