Date and Interval Addition in SparkSQL

Tags:

I am trying to execute a simple SQL query on some dataframe in spark-shell the query adds interval of 1 week to some date as follows:

The original query:

scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show()

Now when I did some tests:

scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show()

I got the results correctly

+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
|                                                                  1999-09-26|
+----------------------------------------------------------------------------+

(Just adding 7 days to 19 = 26)

But when I just changed the year to 1997 instead of 1999 the results changed!

scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show()

+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
|                                                                  1997-09-25|
+----------------------------------------------------------------------------+

Why the reuslts changed? Shouldn't it be 26 not 25?

So, is this a bug in sparkSQL related to some kind of itermediate calculations loss or I am missing something?

897

asked Jul 28 '17 19:07

yakout

2 Answers

This is probably a matter of conversions to local time. INTERVAL casts data to TIMESTAMP and then back to DATE:

scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain
== Physical Plan ==
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19]
+- Scan OneRowRelation[]

(note the second and third CASTs) and Spark is known to be inconsequent when handling timestamps.

DATE_ADD should exhibit more stable behavior:

scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain
== Physical Plan ==
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27]
+- Scan OneRowRelation[]

151

answered Sep 23 '22 05:09

Alper t. Turker

As of Spark 3, this bug has been fixed. Let's create a DataFrame with the dates you mentioned and add a week interval. Create the DataFrame.

import java.sql.Date

val df = Seq(
  (Date.valueOf("1999-09-19")),
  (Date.valueOf("1997-09-19"))
).toDF("some_date")

Add a week interval:

df
  .withColumn("plus_one_week", expr("some_date + INTERVAL 1 week"))
  .show()

+----------+-------------+
| some_date|plus_one_week|
+----------+-------------+
|1999-09-19|   1999-09-26|
|1997-09-19|   1997-09-26|
+----------+-------------+

You can also get this same result with the make_interval() SQL function:

df
  .withColumn("plus_one_week", expr("some_date + make_interval(0, 0, 1, 0, 0, 0, 0)"))
  .show()

We're working on getting make_interval() exposed as Scala/PySpark functions, so it's not necessary to use expr to access the function.

date_add only works for adding days, so it's limited. make_interval() is a lot more powerful because it lets you add any combination of years / months / days / hours / minutes / seconds.

answered Sep 22 '22 05:09

Powers

Related questions
                            
                                MySQL Workbench Error Code 29: (Errcode13 -Permission denied)
                            
                                How do I create custom sequence in PostgreSQL based on date of row creation?
                            
                                Fastest way to copy sql table
                            
                                TSQL - Invalid Column Name RowNumber [duplicate]
                            
                                NULL vs 0 in integer fields
                            
                                SQL Replace multiple variables from another table in query result
                            
                                Converting from base64 string to varbinary(max) in SQL Server
                            
                                Return rows of a table that actually changed in an UPDATE
                            
                                Output "yes/no" instead of "t/f" for boolean data type in PostgreSQL
                            
                                Check if all three columns are either not null or null
                            
                                What is the equivalent of XML PATH and Stuff in Linq lambda expression (GROUP_CONCAT/STRING_AGG)?
                            
                                Stored procedure: reduce code duplication using temp tables
                            
                                Multiple column values into a single column as comma separated value
                            
                                Running Multiplication in T-SQL
                            
                                Performance of DELETE with NOT IN (SELECT ...)
                            
                                NullHandling.NULLS_LAST not working
                            
                                postgres syntax error at or near "ON"
                            
                                select rows and generate a new column based on condition SQL
                            
                                Postgresql - how to run a query on multiple tables with same schema
                            
                                SQL Server Bulk Insert - 0 row(s) affected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Date and Interval Addition in SparkSQL

Tags:

sql

apache-spark

apache-spark-sql

yakout

People also ask

2 Answers

Alper t. Turker

Powers

Recent Activity

Donate For Us