Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate time between two dates in pyspark

Hoping this is fairly elementary. I have a Spark dataframe containing a Date column, I want to add a new column with number of days since that date. Google fu is failing me.

Here's what I've tried:

from pyspark.sql.types import *
import datetime
today = datetime.date.today()

schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2016,12,1),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',today - df.foo)
df.show()

it fails with error:

u"cannot resolve '(17212 - foo)' due to data type mismatch: '(17212 - foo)' requires (numeric or calendarinterval) type, not date;"

I've tried fiddling around but gotten nowhere. I can't think that this is too hard. Can anyone help?

like image 741
jamiet Avatar asked Feb 15 '17 22:02

jamiet


1 Answers

OK, figured it out

from pyspark.sql.types import *
import pyspark.sql.functions as funcs
import datetime
today = datetime.date(2017,2,15)

schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2017,2,14),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',funcs.datediff(funcs.lit(today), df.foo))
df.collect()

returns [Row(foo=datetime.date(2017, 2, 14), daysBetween=1)]

like image 162
jamiet Avatar answered Oct 13 '22 04:10

jamiet