Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate date difference in pyspark?

I have data like this:

df = sqlContext.createDataFrame([     ('1986/10/15', 'z', 'null'),      ('1986/10/15', 'z', 'null'),     ('1986/10/15', 'c', 'null'),     ('1986/10/15', 'null', 'null'),     ('1986/10/16', 'null', '4.0')],     ('low', 'high', 'normal')) 

I want to calculate the date difference between low column and 2017-05-02 and replace low column with the difference. I've tried related solutions on stackoverflow but neither of them works.

like image 277
dlwlrma Avatar asked May 17 '17 09:05

dlwlrma


People also ask

How do you find the date difference in Pyspark?

In order to get difference between two dates in days, years, months and quarters in pyspark can be accomplished by using datediff() and months_between() function. datediff() Function calculates the difference between two dates in days in pyspark.

How do you find the date difference on a spark?

2 Answers. Show activity on this post. SELECT startDate, endDate, DATEDIFF( endDate, startDate ) AS diff_days, CAST( months_between( endDate, startDate ) AS INT ) AS diff_months FROM yourTable ORDER BY 1; There are also year and quarter functions for determining the year and quarter of a date respectively.

How do you subtract days in Pyspark?

In order to subtract or add days , months and years to timestamp in pyspark we will be using date_add() function and add_months() function. add_months() Function with number of months as argument to add months to timestamp in pyspark. date_add() Function number of days as argument to add months to timestamp.

How do I find the difference between two dates in Python?

Use the strptime(date_str, format) function to convert a date string into a datetime object as per the corresponding format . To get the difference between two dates, subtract date2 from date1.


Video Answer


1 Answers

You need to cast the column low to class date and then you can use datediff() in combination with lit(). Using Spark 2.2:

from pyspark.sql.functions import datediff, to_date, lit  df.withColumn("test",                datediff(to_date(lit("2017-05-02")),                        to_date("low","yyyy/MM/dd"))).show() +----------+----+------+-----+ |       low|high|normal| test| +----------+----+------+-----+ |1986/10/15|   z|  null|11157| |1986/10/15|   z|  null|11157| |1986/10/15|   c|  null|11157| |1986/10/15|null|  null|11157| |1986/10/16|null|   4.0|11156| +----------+----+------+-----+ 

Using < Spark 2.2, we need to convert the the low column to class timestamp first:

from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp  df.withColumn("test",                datediff(to_date(lit("2017-05-02")),                        to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show() 
like image 106
mtoto Avatar answered Sep 19 '22 12:09

mtoto