Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference of two TimestampTypes in pyspark

Tags:

Using pyspark, I have a data frame with two TimestampType columns:

df.schema
...StructField(session_start,TimestampType,true),StructField(session_end,TimestampType,true)...

But I can't figure out how to compute the difference:

df2 = df.withColumn("session_length",col("session_end")-col("session_start"))

gives me

AnalysisException: u"cannot resolve '(`session_end` - `session_start`)' due to data type mismatch: '(`session_end` - `session_start`)' requires (numeric or calendarinterval) type, not timestamp ...

I haven't found an alternative that works. (There's a datediff function, but that returns a result in days, and I need the difference in seconds.)

How should I write this?

EDIT: The original version of this question had a different error caused by reusing col as a variable in my notebook. After redoing the import to get the function back, I now get the above AnalysisException.

like image 823
ajb Avatar asked Mar 23 '18 18:03

ajb


1 Answers

You should use pyspark.sql.functions.unix_timestamp to convert your columns, then take the difference:

from pyspark.sql.functions import unix_timestamp
df2 = df.withColumn(
    "session_length",
    unix_timestamp("session_end") - unix_timestamp("session_start")
)
like image 175
pault Avatar answered Sep 23 '22 11:09

pault