Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subtract a column of days from a column of dates in Pyspark?

Given the following PySpark DataFrame

df = sqlContext.createDataFrame([('2015-01-15', 10),
                                 ('2015-02-15', 5)],
                                 ('date_col', 'days_col'))

How can the days column be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'].

I looked into pyspark.sql.functions.date_sub(), but it requires a date column and a single day, i.e. date_sub(df['date_col'], 10). Ideally, I'd prefer to do date_sub(df['date_col'], df['days_col']).

I also tried creating a UDF:

from datetime import timedelta
def subtract_date(start_date, days_to_subtract):
    return start_date - timedelta(days_to_subtract)

subtract_date_udf = udf(subtract_date, DateType())
df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col'])

This technically works, but I've read that stepping between Spark and Python can cause performance issues for large datasets. I can stick with this solution for now (no need to prematurely optimize), but my gut says there's just got to be a way to do this simple thing without using a Python UDF.

like image 604
kjmij Avatar asked Mar 17 '16 03:03

kjmij


People also ask

How do you subtract days in Pyspark?

In order to subtract or add days , months and years to timestamp in pyspark we will be using date_add() function and add_months() function. add_months() Function with number of months as argument to add months to timestamp in pyspark. date_add() Function number of days as argument to add months to timestamp.

How do you subtract date columns in Pyspark?

In order to get difference between two dates in days, years, months and quarters in pyspark can be accomplished by using datediff() and months_between() function. datediff() Function calculates the difference between two dates in days in pyspark.

How do you subtract in Pyspark?

Pretty simple. Use the except() to subtract or find the difference between two dataframes.

How do I subtract a day from a timestamp in python?

You can subtract a day from a python date using the timedelta object. You need to create a timedelta object with the amount of time you want to subtract. Then subtract it from the date.

How to subtract days from timestamp in pyspark?

To subtract days from timestamp in pyspark we will be using date_sub () function with column name and mentioning the number of days to be subtracted as argument as shown below view source print? In our example to birthdaytime column we will be subtracting 10 days. So the resultant dataframe will be

How to get day week month year year and quarter from pyspark?

Get Day, Week, Month, Year and Quarter from date in Pyspark. In order to get month, year and quarter from pyspark we will be using month (), year () and quarter () function respectively. year () Function with column name as argument extracts year from date in pyspark. month () Function with column name as argument extracts month from date in ...

How to extract year from date in pyspark using date_format ()?

Extract Year from date in pyspark using date_format () : Method 2: First the date column on which year value has to be found is converted to timestamp and passed to date_format () function. date_format () Function with column name and “Y” as argument extracts year from date in pyspark and stored in the column name “year” as shown below.

What are the pyspark SQL date functions?

Below are some of the PySpark SQL Date functions, these functions operate on the just Date. The default format of the PySpark Date is yyyy-MM-dd. Returns the current date as a date column. Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.


Video Answer


1 Answers

Use expr function (if you have dynamic values from columns to substract):

>>> from pyspark.sql.functions import *
>>> df.withColumn('substracted_dates',expr("date_sub(date_col,days_col)"))

Use withColumn function(if you have literal values to substract):

>>> df.withColumn('substracted_dates',date_sub('date_col',<int_literal_value>))
like image 130
notNull Avatar answered Sep 21 '22 00:09

notNull