Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add months to date column in Spark dataframe

I have a scenario where I want to add months to a date column in spark DataFrame which has two columns with data type (Date, Int)

e.g.

df.show()
data_date months_to_add
2015-06-23 5
2016-07-20 7

I want to add a new column which will have a new date (After adding months to existing date) and output will look like below-

data_date month_to_add new_data_date
2015-06-23 5           2015-11-23
2016-07-20 1           2016-8-20

I have tried below piece of code, but it does not seems to be working-

df = df.withColumn("new_data_date", a
  dd_months(col("data_date"), col("months_to_add")))

it gives me error-

'Column' object is not callable

Please help me if there is any method to achieve this without using SQL query on top of dataframe.

like image 650
anurag Avatar asked Aug 10 '17 11:08

anurag


People also ask

How do I change the date format in spark?

Spark provides current_date() function to get the current system date in DateType 'yyyy-MM-dd' format and current_timestamp() to get current timestamp in `yyyy-MM-dd HH:mm:ss. SSSS` format.

Is date function in spark?

The Spark SQL functions package is imported into the environment to run date functions. Seq() function takes the date 2021-02-14 as Input. The current_date function takes the current date. date_format() function changes the Input date in a dd-MM-yyyy format, and thus the output is displayed.

How do I convert a string to a date in spark?

Spark to_date() – Convert String to Date format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. Below code, snippet takes the date in a string and converts it to date format on DataFrame.

How do dates work in PySpark?

PySpark to_date() – Convert Timestamp to Date PySpark timestamp ( TimestampType ) consists of value in the format yyyy-MM-dd HH:mm:ss. SSSS and Date ( DateType ) format would be yyyy-MM-dd . Use to_date() function to truncate time from Timestamp or to convert the timestamp to date on DataFrame column.


1 Answers

I'd use expr:

from pyspark.sql.functions import expr

df = spark.createDataFrame(
    [("2015-06-23", 5), ("2016-07-20", 7)],
    ("data_date", "months_to_add")
).select(to_date("data_date").alias("data_date"), "months_to_add")

df.withColumn("new_data_date", expr("add_months(data_date, months_to_add)")).show()

+----------+-------------+-------------+
| data_date|months_to_add|new_data_date|
+----------+-------------+-------------+
|2015-06-23|            5|   2015-11-23|
|2016-07-20|            7|   2017-02-20|
+----------+-------------+-------------+
like image 95
Alper t. Turker Avatar answered Oct 20 '22 08:10

Alper t. Turker