I have data like this:
df = sqlContext.createDataFrame([ ('1986/10/15', 'z', 'null'), ('1986/10/15', 'z', 'null'), ('1986/10/15', 'c', 'null'), ('1986/10/15', 'null', 'null'), ('1986/10/16', 'null', '4.0')], ('low', 'high', 'normal'))
I want to calculate the date difference between low
column and 2017-05-02
and replace low
column with the difference. I've tried related solutions on stackoverflow but neither of them works.
In order to get difference between two dates in days, years, months and quarters in pyspark can be accomplished by using datediff() and months_between() function. datediff() Function calculates the difference between two dates in days in pyspark.
2 Answers. Show activity on this post. SELECT startDate, endDate, DATEDIFF( endDate, startDate ) AS diff_days, CAST( months_between( endDate, startDate ) AS INT ) AS diff_months FROM yourTable ORDER BY 1; There are also year and quarter functions for determining the year and quarter of a date respectively.
In order to subtract or add days , months and years to timestamp in pyspark we will be using date_add() function and add_months() function. add_months() Function with number of months as argument to add months to timestamp in pyspark. date_add() Function number of days as argument to add months to timestamp.
Use the strptime(date_str, format) function to convert a date string into a datetime object as per the corresponding format . To get the difference between two dates, subtract date2 from date1.
You need to cast the column low
to class date and then you can use datediff()
in combination with lit()
. Using Spark 2.2:
from pyspark.sql.functions import datediff, to_date, lit df.withColumn("test", datediff(to_date(lit("2017-05-02")), to_date("low","yyyy/MM/dd"))).show() +----------+----+------+-----+ | low|high|normal| test| +----------+----+------+-----+ |1986/10/15| z| null|11157| |1986/10/15| z| null|11157| |1986/10/15| c| null|11157| |1986/10/15|null| null|11157| |1986/10/16|null| 4.0|11156| +----------+----+------+-----+
Using < Spark 2.2, we need to convert the the low
column to class timestamp
first:
from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp df.withColumn("test", datediff(to_date(lit("2017-05-02")), to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With