Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Date and Interval Addition in SparkSQL

I am trying to execute a simple SQL query on some dataframe in spark-shell the query adds interval of 1 week to some date as follows:

The original query:

scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show()

Now when I did some tests:

scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show()

I got the results correctly

+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
|                                                                  1999-09-26|
+----------------------------------------------------------------------------+

(Just adding 7 days to 19 = 26)

But when I just changed the year to 1997 instead of 1999 the results changed!

scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show()

+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
|                                                                  1997-09-25|
+----------------------------------------------------------------------------+

Why the reuslts changed? Shouldn't it be 26 not 25?

So, is this a bug in sparkSQL related to some kind of itermediate calculations loss or I am missing something?

like image 897
yakout Avatar asked Jul 28 '17 19:07

yakout


People also ask

How do you add a timestamp in spark?

The current timestamp can be added as a new column to spark Dataframe using the current_timestamp() function of the sql module in pyspark. The method returns the timestamp in the yyyy-mm-dd hh:mm:ss. nnn format.

How do you get the date on the spark?

Spark SQL provides <em>current_date</em>() and <em>current_timestamp</em>() functions which returns the current system date without timestamp and current system data with timestamp respectively, Let's see how to get these with Scala and Pyspark examples.

What are the various data sources available in Sparksql?

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra, and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer.

Does Sparksql support subquery?

Subqueries (Subquery Expressions)As of Spark 2.0, Spark SQL supports subqueries. A subquery (aka subquery expression) is a query that is nested inside of another query. There are the following kinds of subqueries: A subquery as a source (inside a SQL FROM clause)


2 Answers

This is probably a matter of conversions to local time. INTERVAL casts data to TIMESTAMP and then back to DATE:

scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain
== Physical Plan ==
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19]
+- Scan OneRowRelation[]

(note the second and third CASTs) and Spark is known to be inconsequent when handling timestamps.

DATE_ADD should exhibit more stable behavior:

scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain
== Physical Plan ==
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27]
+- Scan OneRowRelation[]
like image 151
Alper t. Turker Avatar answered Sep 23 '22 05:09

Alper t. Turker


As of Spark 3, this bug has been fixed. Let's create a DataFrame with the dates you mentioned and add a week interval. Create the DataFrame.

import java.sql.Date

val df = Seq(
  (Date.valueOf("1999-09-19")),
  (Date.valueOf("1997-09-19"))
).toDF("some_date")

Add a week interval:

df
  .withColumn("plus_one_week", expr("some_date + INTERVAL 1 week"))
  .show()
+----------+-------------+
| some_date|plus_one_week|
+----------+-------------+
|1999-09-19|   1999-09-26|
|1997-09-19|   1997-09-26|
+----------+-------------+

You can also get this same result with the make_interval() SQL function:

df
  .withColumn("plus_one_week", expr("some_date + make_interval(0, 0, 1, 0, 0, 0, 0)"))
  .show()

We're working on getting make_interval() exposed as Scala/PySpark functions, so it's not necessary to use expr to access the function.

date_add only works for adding days, so it's limited. make_interval() is a lot more powerful because it lets you add any combination of years / months / days / hours / minutes / seconds.

like image 41
Powers Avatar answered Sep 22 '22 05:09

Powers