I am having a Spark SQL <code>DataFrame</code> with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a <code>Window Function</code> like: <pre class="prettyprint lang-py prettyprint-override"><code>Window \ .partitionBy('id') \ .orderBy('start') </code></pre> and here comes the problem. I want to have a <code>rangeBetween</code> 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with: <pre class="prettyprint lang-py prettyprint-override"><code>.rowsBetween(-sys.maxsize, 0) </code></pre> but would like to achieve something like: <pre class="prettyprint lang-py prettyprint-override"><code>.rangeBetween("7 days", 0) </code></pre> If anyone could help me on this one I'll be very grateful. Thanks in advance!

Spark >= 2.3 Since Spark 2.3 it is possible to use interval objects using SQL API, but the <code>DataFrame</code> API support is still work in progress. <pre class="prettyprint lang-py prettyprint-override"><code>df.createOrReplaceTempView("df") spark.sql( """SELECT *, mean(some_value) OVER ( PARTITION BY id ORDER BY CAST(start AS timestamp) RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW ) AS mean FROM df""").show() ## +---+----------+----------+------------------+ ## | id| start|some_value| mean| ## +---+----------+----------+------------------+ ## | 1|2015-01-01| 20.0| 20.0| ## | 1|2015-01-06| 10.0| 15.0| ## | 1|2015-01-07| 25.0|18.333333333333332| ## | 1|2015-01-12| 30.0|21.666666666666668| ## | 2|2015-01-01| 5.0| 5.0| ## | 2|2015-01-03| 30.0| 17.5| ## | 2|2015-02-01| 20.0| 20.0| ## +---+----------+----------+------------------+ </code></pre> Spark < 2.3 As far as I know it is not possible directly neither in Spark nor Hive. Both require <code>ORDER BY</code> clause used with <code>RANGE</code> to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming <code>start</code> column contains <code>date</code> type: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql import Row row = Row("id", "start", "some_value") df = sc.parallelize([ row(1, "2015-01-01", 20.0), row(1, "2015-01-06", 10.0), row(1, "2015-01-07", 25.0), row(1, "2015-01-12", 30.0), row(2, "2015-01-01", 5.0), row(2, "2015-01-03", 30.0), row(2, "2015-02-01", 20.0) ]).toDF().withColumn("start", col("start").cast("date")) </code></pre> A small helper and window definition: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.window import Window from pyspark.sql.functions import mean, col # Hive timestamp is interpreted as UNIX timestamp in seconds* days = lambda i: i * 86400 </code></pre> Finally query: <pre class="prettyprint lang-py prettyprint-override"><code>w = (Window() .partitionBy(col("id")) .orderBy(col("start").cast("timestamp").cast("long")) .rangeBetween(-days(7), 0)) df.select(col("*"), mean("some_value").over(w).alias("mean")).show() ## +---+----------+----------+------------------+ ## | id| start|some_value| mean| ## +---+----------+----------+------------------+ ## | 1|2015-01-01| 20.0| 20.0| ## | 1|2015-01-06| 10.0| 15.0| ## | 1|2015-01-07| 25.0|18.333333333333332| ## | 1|2015-01-12| 30.0|21.666666666666668| ## | 2|2015-01-01| 5.0| 5.0| ## | 2|2015-01-03| 30.0| 17.5| ## | 2|2015-02-01| 20.0| 20.0| ## +---+----------+----------+------------------+ </code></pre> Far from pretty but works. <hr> * Hive Language Manual, Types

Fantastic solution @zero323, if you want to operate with minutes instead of days as I have to, and you don't need to partition with id, so you only have to modify a simply part of the code as I show: <pre class="prettyprint"><code>df.createOrReplaceTempView("df") spark.sql( """SELECT *, sum(total) OVER ( ORDER BY CAST(reading_date AS timestamp) RANGE BETWEEN INTERVAL 45 minutes PRECEDING AND CURRENT ROW ) AS sum_total FROM df""").show() </code></pre>

Spark Window Functions - rangeBetween dates

Tags:

sql

window-functions

apache-spark

apache-spark-sql

pyspark

I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like:

Window \     .partitionBy('id') \     .orderBy('start')

and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:

.rowsBetween(-sys.maxsize, 0)

but would like to achieve something like:

.rangeBetween("7 days", 0)

If anyone could help me on this one I'll be very grateful. Thanks in advance!

887

asked Oct 19 '15 05:10

Nhor

2 Answers

Spark >= 2.3

Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress.

df.createOrReplaceTempView("df")  spark.sql(     """SELECT *, mean(some_value) OVER (         PARTITION BY id          ORDER BY CAST(start AS timestamp)          RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW      ) AS mean FROM df""").show()  ## +---+----------+----------+------------------+        ## | id|     start|some_value|              mean| ## +---+----------+----------+------------------+ ## |  1|2015-01-01|      20.0|              20.0| ## |  1|2015-01-06|      10.0|              15.0| ## |  1|2015-01-07|      25.0|18.333333333333332| ## |  1|2015-01-12|      30.0|21.666666666666668| ## |  2|2015-01-01|       5.0|               5.0| ## |  2|2015-01-03|      30.0|              17.5| ## |  2|2015-02-01|      20.0|              20.0| ## +---+----------+----------+------------------+

Spark < 2.3

As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:

from pyspark.sql import Row  row = Row("id", "start", "some_value") df = sc.parallelize([     row(1, "2015-01-01", 20.0),     row(1, "2015-01-06", 10.0),     row(1, "2015-01-07", 25.0),     row(1, "2015-01-12", 30.0),     row(2, "2015-01-01", 5.0),     row(2, "2015-01-03", 30.0),     row(2, "2015-02-01", 20.0) ]).toDF().withColumn("start", col("start").cast("date"))

A small helper and window definition:

from pyspark.sql.window import Window from pyspark.sql.functions import mean, col   # Hive timestamp is interpreted as UNIX timestamp in seconds* days = lambda i: i * 86400

Finally query:

w = (Window()    .partitionBy(col("id"))    .orderBy(col("start").cast("timestamp").cast("long"))    .rangeBetween(-days(7), 0))  df.select(col("*"), mean("some_value").over(w).alias("mean")).show()  ## +---+----------+----------+------------------+ ## | id|     start|some_value|              mean| ## +---+----------+----------+------------------+ ## |  1|2015-01-01|      20.0|              20.0| ## |  1|2015-01-06|      10.0|              15.0| ## |  1|2015-01-07|      25.0|18.333333333333332| ## |  1|2015-01-12|      30.0|21.666666666666668| ## |  2|2015-01-01|       5.0|               5.0| ## |  2|2015-01-03|      30.0|              17.5| ## |  2|2015-02-01|      20.0|              20.0| ## +---+----------+----------+------------------+

Far from pretty but works.

* Hive Language Manual, Types

answered Sep 17 '22 14:09

zero323

Fantastic solution @zero323, if you want to operate with minutes instead of days as I have to, and you don't need to partition with id, so you only have to modify a simply part of the code as I show:

df.createOrReplaceTempView("df") spark.sql(     """SELECT *, sum(total) OVER (         ORDER BY CAST(reading_date AS timestamp)          RANGE BETWEEN INTERVAL 45 minutes PRECEDING AND CURRENT ROW      ) AS sum_total FROM df""").show()

answered Sep 19 '22 14:09

pabloverd

Related questions
                            
                                Oracle 'Partition By' and 'Row_Number' keyword
                            
                                Create date from day, month, year fields in MySQL
                            
                                quick selection of a random row from a large table in mysql
                            
                                How can I reorder rows in sql database
                            
                                Rails - Delete all Records that Meet a Condition
                            
                                What column data type should I use for storing large amounts of text or html
                            
                                ORDER BY "ENUM field" in MYSQL
                            
                                Where value in column containing comma delimited values
                            
                                How can I delete one of two perfectly identical rows?
                            
                                Why use SQLAlchemy? Is it very convinent for coding? [closed]
                            
                                Sample database for exercise [closed]
                            
                                Finding columns that are NOT NULL in PostgreSQL
                            
                                Variable column names using prepared statements
                            
                                ORA-01652: unable to extend temp segment by 128 in tablespace SYSTEM: How to extend?
                            
                                How to insert date values into table
                            
                                How to alter column nvarchar length without drop
                            
                                Executing a stored procedure inside BEGIN/END TRANSACTION
                            
                                Why are batch inserts/updates faster? How do batch updates work?
                            
                                How to insert a blob into a database using sql server management studio
                            
                                How to use a package constant in SQL SELECT statement?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With