Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - Datediff for months?

Is there a good way to use datediff with months? To clarify: the datediff method takes two columns and returns the number of days that have passed between the two dates. I'd like to have that in months. I want to have a parameter in my function that can I tell to check data say from the last 20, 36, whatever months. If I just do datediff and divide the result with 30 (or 31), than the result is not quite accurate. I could use 30.4166667 (= 365 days/12 months), but that is not quite accurate either for shorter periods. So, any tips on how to use datediff to be able to get months out of it? SQL has it like SELECT DATEDIFF(month, '2005-12-31 23:59:59.9999999', '2006-01-01 00:00:00.0000000');, I'm looking for something like this in Spark.

like image 355
lte__ Avatar asked Aug 10 '16 07:08

lte__


People also ask

How do you calculate months between two dates in PySpark?

Using PySpark SQL functions datediff() , months_between() you can calculate the difference between two dates in days, months, and year, let's see this by using a DataFrame example. You can also use these to calculate age.

How do you subtract months in PySpark?

Spark SQL provides DataFrame function add_months() to add or subtract months from a Date Column and date_add() , date_sub() to add and subtract days.

What is PySpark datediff?

pyspark.sql.functions. datediff (end, start)[source] Returns the number of days from start to end .

How is PySpark time difference calculated?

Timestamp difference in PySpark can be calculated by using 1) unix_timestamp() to get the Time in seconds and subtract with other time to get the seconds 2) Cast TimestampType column to LongType and subtract two long values to get the difference in seconds, divide it by 60 to get the minute difference and finally ...


2 Answers

You can try months_between:

import org.apache.spark.sql.functions.*
DataFrame newDF = df.withColumn("monthDiff", months_between(col("col1"), col("col2"))
like image 102
Daniel de Paula Avatar answered Oct 13 '22 14:10

Daniel de Paula


This worked for me:

from pyspark.sql.functions import months_between

data = sqlContext.sql('''
SELECT DISTINCT mystartdate,myenddate,
 CAST(months_between(mystartdate,myenddate) as int) as months_tenure
FROM mydatabase
''')
like image 28
Irene Avatar answered Oct 13 '22 15:10

Irene