Given the following PySpark DataFrame <pre class="prettyprint lang-py prettyprint-override"><code>df = sqlContext.createDataFrame([('2015-01-15', 10), ('2015-02-15', 5)], ('date_col', 'days_col')) </code></pre> How can the days column be subtracted from the date column? In this example, the resulting column should be <code>['2015-01-05', '2015-02-10']</code>. I looked into <code>pyspark.sql.functions.date_sub()</code>, but it requires a date column and a single day, i.e. <code>date_sub(df['date_col'], 10)</code>. Ideally, I'd prefer to do <code>date_sub(df['date_col'], df['days_col'])</code>. I also tried creating a UDF: <pre class="prettyprint"><code>from datetime import timedelta def subtract_date(start_date, days_to_subtract): return start_date - timedelta(days_to_subtract) subtract_date_udf = udf(subtract_date, DateType()) df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col']) </code></pre> This technically works, but I've read that stepping between Spark and Python can cause performance issues for large datasets. I can stick with this solution for now (no need to prematurely optimize), but my gut says there's just got to be a way to do this simple thing without using a Python UDF.

Use <code>expr</code> function (if you have <code>dynamic values</code> from columns to substract): <pre class="prettyprint"><code>>>> from pyspark.sql.functions import * >>> df.withColumn('substracted_dates',expr("date_sub(date_col,days_col)")) </code></pre> Use withColumn function(if you have <code>literal values</code> to substract): <pre class="prettyprint"><code>>>> df.withColumn('substracted_dates',date_sub('date_col',<int_literal_value>)) </code></pre>

How to subtract a column of days from a column of dates in Pyspark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

user-defined-functions

Given the following PySpark DataFrame

df = sqlContext.createDataFrame([('2015-01-15', 10),
                                 ('2015-02-15', 5)],
                                 ('date_col', 'days_col'))

How can the days column be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'].

I looked into pyspark.sql.functions.date_sub(), but it requires a date column and a single day, i.e. date_sub(df['date_col'], 10). Ideally, I'd prefer to do date_sub(df['date_col'], df['days_col']).

I also tried creating a UDF:

from datetime import timedelta
def subtract_date(start_date, days_to_subtract):
    return start_date - timedelta(days_to_subtract)

subtract_date_udf = udf(subtract_date, DateType())
df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col'])

This technically works, but I've read that stepping between Spark and Python can cause performance issues for large datasets. I can stick with this solution for now (no need to prematurely optimize), but my gut says there's just got to be a way to do this simple thing without using a Python UDF.

604

asked Mar 17 '16 03:03

kjmij

Video Answer

1 Answers

Use expr function (if you have dynamic values from columns to substract):

>>> from pyspark.sql.functions import *
>>> df.withColumn('substracted_dates',expr("date_sub(date_col,days_col)"))

Use withColumn function(if you have literal values to substract):

>>> df.withColumn('substracted_dates',date_sub('date_col',<int_literal_value>))

130

answered Sep 21 '22 00:09

notNull

Related questions
                            
                                How should I establish and manage database connections in a multi-module Python app?
                            
                                Compute a chain of functions in python
                            
                                Escaping quotes in jinja2
                            
                                How to drawImage a matplotlib figure in a reportlab canvas?
                            
                                How to create lazy_evaluated dataframe columns in Pandas
                            
                                Understanding scipy's least square function with IRLS
                            
                                Passing arguments to fsolve
                            
                                Calling flask restful API resource methods
                            
                                celery missed heartbeat (on_node_lost)
                            
                                Why does backward recursion execute faster than forward recursion in python
                            
                                Python - using a shared variable in a recursive function
                            
                                python argparse - add action to subparser with no arguments?
                            
                                decorate __call__ with @staticmethod
                            
                                How to add bold and normal text in one line using drawString method in reportlab (python)
                            
                                Add to a deque being iterated in Python?
                            
                                How do you read a lambda function as a string?
                            
                                Subtracting pandas timestamps; absolute value
                            
                                PyMySQL returning old/snapshot values/not rerunning query?
                            
                                Plot pandas dataframe with subplots (subplots=True): Place legend and use tight layout
                            
                                Seaborn FacetGrid barplots and hue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With