I am trying to execute a simple SQL query on some dataframe in spark-shell the query adds interval of 1 week to some date as follows:
The original query:
scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show()
Now when I did some tests:
scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show()
I got the results correctly
+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
| 1999-09-26|
+----------------------------------------------------------------------------+
(Just adding 7 days to 19 = 26)
But when I just changed the year to 1997 instead of 1999 the results changed!
scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show()
+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
| 1997-09-25|
+----------------------------------------------------------------------------+
Why the reuslts changed? Shouldn't it be 26 not 25?
So, is this a bug in sparkSQL related to some kind of itermediate calculations loss or I am missing something?
The current timestamp can be added as a new column to spark Dataframe using the current_timestamp() function of the sql module in pyspark. The method returns the timestamp in the yyyy-mm-dd hh:mm:ss. nnn format.
Spark SQL provides <em>current_date</em>() and <em>current_timestamp</em>() functions which returns the current system date without timestamp and current system data with timestamp respectively, Let's see how to get these with Scala and Pyspark examples.
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra, and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer.
Subqueries (Subquery Expressions)As of Spark 2.0, Spark SQL supports subqueries. A subquery (aka subquery expression) is a query that is nested inside of another query. There are the following kinds of subqueries: A subquery as a source (inside a SQL FROM clause)
This is probably a matter of conversions to local time. INTERVAL
casts data to TIMESTAMP
and then back to DATE
:
scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain
== Physical Plan ==
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19]
+- Scan OneRowRelation[]
(note the second and third CASTs
) and Spark is known to be inconsequent when handling timestamps.
DATE_ADD
should exhibit more stable behavior:
scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain
== Physical Plan ==
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27]
+- Scan OneRowRelation[]
As of Spark 3, this bug has been fixed. Let's create a DataFrame with the dates you mentioned and add a week interval. Create the DataFrame.
import java.sql.Date
val df = Seq(
(Date.valueOf("1999-09-19")),
(Date.valueOf("1997-09-19"))
).toDF("some_date")
Add a week interval:
df
.withColumn("plus_one_week", expr("some_date + INTERVAL 1 week"))
.show()
+----------+-------------+
| some_date|plus_one_week|
+----------+-------------+
|1999-09-19| 1999-09-26|
|1997-09-19| 1997-09-26|
+----------+-------------+
You can also get this same result with the make_interval()
SQL function:
df
.withColumn("plus_one_week", expr("some_date + make_interval(0, 0, 1, 0, 0, 0, 0)"))
.show()
We're working on getting make_interval() exposed as Scala/PySpark functions, so it's not necessary to use expr
to access the function.
date_add
only works for adding days, so it's limited. make_interval()
is a lot more powerful because it lets you add any combination of years / months / days / hours / minutes / seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With