This is probably easiest to explain through example. Suppose I have a DataFrame of user logins to a website, for instance: <pre class="prettyprint"><code>scala> df.show(5) +----------------+----------+ | user_name|login_date| +----------------+----------+ |SirChillingtonIV|2012-01-04| |Booooooo99900098|2012-01-04| |Booooooo99900098|2012-01-06| | OprahWinfreyJr|2012-01-10| |SirChillingtonIV|2012-01-11| +----------------+----------+ only showing top 5 rows </code></pre> I would like to add to this a column indicating when they became an active user on the site. But there is one caveat: there is a time period during which a user is considered active, and after this period, if they log in again, their <code>became_active</code> date resets. Suppose this period is 5 days. Then the desired table derived from the above table would be something like this: <pre class="prettyprint"><code>+----------------+----------+-------------+ | user_name|login_date|became_active| +----------------+----------+-------------+ |SirChillingtonIV|2012-01-04| 2012-01-04| |Booooooo99900098|2012-01-04| 2012-01-04| |Booooooo99900098|2012-01-06| 2012-01-04| | OprahWinfreyJr|2012-01-10| 2012-01-10| |SirChillingtonIV|2012-01-11| 2012-01-11| +----------------+----------+-------------+ </code></pre> So, in particular, SirChillingtonIV's <code>became_active</code> date was reset because their second login came after the active period expired, but Booooooo99900098's <code>became_active</code> date was not reset the second time he/she logged in, because it fell within the active period. My initial thought was to use window functions with <code>lag</code>, and then using the <code>lag</code>ged values to fill the <code>became_active</code> column; for instance, something starting roughly like: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ val window = Window.partitionBy("user_name").orderBy("login_date") val df2 = df.withColumn("tmp", lag("login_date", 1).over(window)) </code></pre> Then, the rule to fill in the <code>became_active</code> date would be, if <code>tmp</code> is <code>null</code> (i.e., if it's the first ever login) or if <code>login_date - tmp >= 5</code> then <code>became_active = login_date</code>; otherwise, go to the next most recent value in <code>tmp</code> and apply the same rule. This suggests a recursive approach, which I'm having trouble imagining a way to implement. My questions: Is this a viable approach, and if so, how can I "go back" and look at earlier values of <code>tmp</code> until I find one where I stop? I can't, to my knowledge, iterate through values of a Spark SQL <code>Column</code>. Is there another way to achieve this result?

Spark >= 3.2 Recent Spark releases provide native support for session windows in both batch and structured streaming queries (see SPARK-10816 and its sub-tasks, especially SPARK-34893). The official documentation provides nice usage example. Spark < 3.2 Here is the trick. Import a bunch of functions: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.{coalesce, datediff, lag, lit, min, sum} </code></pre> Define windows: <pre class="prettyprint lang-scala prettyprint-override"><code>val userWindow = Window.partitionBy("user_name").orderBy("login_date") val userSessionWindow = Window.partitionBy("user_name", "session") </code></pre> Find the points where new sessions starts: <pre class="prettyprint lang-scala prettyprint-override"><code>val newSession = (coalesce( datediff($"login_date", lag($"login_date", 1).over(userWindow)), lit(0) ) > 5).cast("bigint") val sessionized = df.withColumn("session", sum(newSession).over(userWindow)) </code></pre> Find the earliest date per session: <pre class="prettyprint lang-scala prettyprint-override"><code>val result = sessionized .withColumn("became_active", min($"login_date").over(userSessionWindow)) .drop("session") </code></pre> With dataset defined as: <pre class="prettyprint lang-scala prettyprint-override"><code>val df = Seq( ("SirChillingtonIV", "2012-01-04"), ("Booooooo99900098", "2012-01-04"), ("Booooooo99900098", "2012-01-06"), ("OprahWinfreyJr", "2012-01-10"), ("SirChillingtonIV", "2012-01-11"), ("SirChillingtonIV", "2012-01-14"), ("SirChillingtonIV", "2012-08-11") ).toDF("user_name", "login_date") </code></pre> The result is: <pre class="prettyprint lang-none prettyprint-override"><code>+----------------+----------+-------------+ | user_name|login_date|became_active| +----------------+----------+-------------+ | OprahWinfreyJr|2012-01-10| 2012-01-10| |SirChillingtonIV|2012-01-04| 2012-01-04| <- The first session for user |SirChillingtonIV|2012-01-11| 2012-01-11| <- The second session for user |SirChillingtonIV|2012-01-14| 2012-01-11| |SirChillingtonIV|2012-08-11| 2012-08-11| <- The third session for user |Booooooo99900098|2012-01-04| 2012-01-04| |Booooooo99900098|2012-01-06| 2012-01-04| +----------------+----------+-------------+ </code></pre>

Spark SQL window function with complex condition

Tags:

sql

window-functions

apache-spark

apache-spark-sql

pyspark

This is probably easiest to explain through example. Suppose I have a DataFrame of user logins to a website, for instance:

scala> df.show(5)
+----------------+----------+
|       user_name|login_date|
+----------------+----------+
|SirChillingtonIV|2012-01-04|
|Booooooo99900098|2012-01-04|
|Booooooo99900098|2012-01-06|
|  OprahWinfreyJr|2012-01-10|
|SirChillingtonIV|2012-01-11|
+----------------+----------+
only showing top 5 rows

I would like to add to this a column indicating when they became an active user on the site. But there is one caveat: there is a time period during which a user is considered active, and after this period, if they log in again, their became_active date resets. Suppose this period is 5 days. Then the desired table derived from the above table would be something like this:

+----------------+----------+-------------+
|       user_name|login_date|became_active|
+----------------+----------+-------------+
|SirChillingtonIV|2012-01-04|   2012-01-04|
|Booooooo99900098|2012-01-04|   2012-01-04|
|Booooooo99900098|2012-01-06|   2012-01-04|
|  OprahWinfreyJr|2012-01-10|   2012-01-10|
|SirChillingtonIV|2012-01-11|   2012-01-11|
+----------------+----------+-------------+

So, in particular, SirChillingtonIV's became_active date was reset because their second login came after the active period expired, but Booooooo99900098's became_active date was not reset the second time he/she logged in, because it fell within the active period.

My initial thought was to use window functions with lag, and then using the lagged values to fill the became_active column; for instance, something starting roughly like:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val window = Window.partitionBy("user_name").orderBy("login_date")
val df2 = df.withColumn("tmp", lag("login_date", 1).over(window))

Then, the rule to fill in the became_active date would be, if tmp is null (i.e., if it's the first ever login) or if login_date - tmp >= 5 then became_active = login_date; otherwise, go to the next most recent value in tmp and apply the same rule. This suggests a recursive approach, which I'm having trouble imagining a way to implement.

My questions: Is this a viable approach, and if so, how can I "go back" and look at earlier values of tmp until I find one where I stop? I can't, to my knowledge, iterate through values of a Spark SQL Column. Is there another way to achieve this result?

569

asked Feb 24 '17 21:02

user4601931

2 Answers

Spark >= 3.2

Recent Spark releases provide native support for session windows in both batch and structured streaming queries (see SPARK-10816 and its sub-tasks, especially SPARK-34893).

The official documentation provides nice usage example.

Spark < 3.2

Here is the trick. Import a bunch of functions:

import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.{coalesce, datediff, lag, lit, min, sum}

Define windows:

val userWindow = Window.partitionBy("user_name").orderBy("login_date") val userSessionWindow = Window.partitionBy("user_name", "session")

Find the points where new sessions starts:

val newSession =  (coalesce(   datediff($"login_date", lag($"login_date", 1).over(userWindow)),   lit(0) ) > 5).cast("bigint")  val sessionized = df.withColumn("session", sum(newSession).over(userWindow))

Find the earliest date per session:

val result = sessionized   .withColumn("became_active", min($"login_date").over(userSessionWindow))   .drop("session")

With dataset defined as:

val df = Seq(   ("SirChillingtonIV", "2012-01-04"), ("Booooooo99900098", "2012-01-04"),   ("Booooooo99900098", "2012-01-06"), ("OprahWinfreyJr", "2012-01-10"),    ("SirChillingtonIV", "2012-01-11"), ("SirChillingtonIV", "2012-01-14"),   ("SirChillingtonIV", "2012-08-11") ).toDF("user_name", "login_date")

The result is:

+----------------+----------+-------------+ |       user_name|login_date|became_active| +----------------+----------+-------------+ |  OprahWinfreyJr|2012-01-10|   2012-01-10| |SirChillingtonIV|2012-01-04|   2012-01-04| <- The first session for user |SirChillingtonIV|2012-01-11|   2012-01-11| <- The second session for user |SirChillingtonIV|2012-01-14|   2012-01-11|  |SirChillingtonIV|2012-08-11|   2012-08-11| <- The third session for user |Booooooo99900098|2012-01-04|   2012-01-04| |Booooooo99900098|2012-01-06|   2012-01-04| +----------------+----------+-------------+

100

answered Sep 28 '22 22:09

zero323

Refactoring the other answer to work with Pyspark

In Pyspark you can do like below.

create data frame

df = sqlContext.createDataFrame(
[
("SirChillingtonIV", "2012-01-04"), 
("Booooooo99900098", "2012-01-04"), 
("Booooooo99900098", "2012-01-06"), 
("OprahWinfreyJr", "2012-01-10"), 
("SirChillingtonIV", "2012-01-11"), 
("SirChillingtonIV", "2012-01-14"), 
("SirChillingtonIV", "2012-08-11")
], 
("user_name", "login_date"))

The above code creates a data frame like below

+----------------+----------+
|       user_name|login_date|
+----------------+----------+
|SirChillingtonIV|2012-01-04|
|Booooooo99900098|2012-01-04|
|Booooooo99900098|2012-01-06|
|  OprahWinfreyJr|2012-01-10|
|SirChillingtonIV|2012-01-11|
|SirChillingtonIV|2012-01-14|
|SirChillingtonIV|2012-08-11|
+----------------+----------+

Now we want to first find out the difference between login_date is more than 5 days.

For this do like below.

Necessary imports

from pyspark.sql import functions as f
from pyspark.sql import Window


# defining window partitions  
login_window = Window.partitionBy("user_name").orderBy("login_date")
session_window = Window.partitionBy("user_name", "session")

session_df = df.withColumn("session", f.sum((f.coalesce(f.datediff("login_date", f.lag("login_date", 1).over(login_window)), f.lit(0)) > 5).cast("int")).over(login_window))

When we run the above line of code if the date_diff is NULL then the coalesce function will replace NULL to 0.

+----------------+----------+-------+
|       user_name|login_date|session|
+----------------+----------+-------+
|  OprahWinfreyJr|2012-01-10|      0|
|SirChillingtonIV|2012-01-04|      0|
|SirChillingtonIV|2012-01-11|      1|
|SirChillingtonIV|2012-01-14|      1|
|SirChillingtonIV|2012-08-11|      2|
|Booooooo99900098|2012-01-04|      0|
|Booooooo99900098|2012-01-06|      0|
+----------------+----------+-------+


# add became_active column by finding the `min login_date` for each window partitionBy `user_name` and `session` created in above step
final_df = session_df.withColumn("became_active", f.min("login_date").over(session_window)).drop("session")

+----------------+----------+-------------+
|       user_name|login_date|became_active|
+----------------+----------+-------------+
|  OprahWinfreyJr|2012-01-10|   2012-01-10|
|SirChillingtonIV|2012-01-04|   2012-01-04|
|SirChillingtonIV|2012-01-11|   2012-01-11|
|SirChillingtonIV|2012-01-14|   2012-01-11|
|SirChillingtonIV|2012-08-11|   2012-08-11|
|Booooooo99900098|2012-01-04|   2012-01-04|
|Booooooo99900098|2012-01-06|   2012-01-04|
+----------------+----------+-------------+

answered Sep 28 '22 21:09

User12345

Related questions
                            
                                How to convert RIGHT LEFT functions to codeigniter active record
                            
                                Linq to 3 tables with no foreign keys
                            
                                MySQL - SELECT AVG on some rows and SUM on all
                            
                                How to convert multiple rows into one row with multiple columns using Pivot in SQL Server when data having NULL values
                            
                                Group rows with similar strings
                            
                                GBQ window function AND arithmetic operations
                            
                                SQL: Select two columns by single column in group by with only having condition
                            
                                Are there SQL datatypes that don't work with R?
                            
                                Oracle SQL: How to INSERT a SELECT statement with a GROUP BY clause on a table with IDENTITY column?
                            
                                How to shift column values in MySQL?
                            
                                Guid with extra characters issue
                            
                                Using sequence.nextval in subquery
                            
                                SQL Server - Selecting periods without changes in data
                            
                                Update ordered row with last not-null value [duplicate]
                            
                                Codeigniter - use two like and where together
                            
                                SQL : FULL OUTER JOIN on null columns
                            
                                SQL Group By and Count on two columns
                            
                                Strategies for checking ISNULL on varbinary fields?
                            
                                Selecting rows with the highest date
                            
                                How to INSERT to a column whose name is a sql keyword

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With