Let's say I've got a list of pageview events, each of which has a session id. For each event I want to add the time and URL of the chronologically first pageview in that event's session. For example, lets say my events are in table test
which looks like this:
uid | session_id | timestamp | url
----------------------------------------------------
u1 0 0 a.com/
u1 1 1 a.com/p1
u1 1 2 a.com/p2
I want a SQL command that produces the following:
uid | session_id | timestamp | url | s_timestamp | s_url
---------------------------------------------------------------------
u1 0 0 a.com/ 0 a.com/
u1 1 1 a.com/p1 1 a.com/p1
u1 1 2 a.com/p2 1 a.com/p1
Window functions seem to be the way to go here, but I'm pretty new to them. The following statement produces the desired table, but I'm wondering if it's suboptimal
SELECT
uid,
session_id,
timestamp,
url,
first_value(url) OVER (PARTITION BY uid, session_id ORDER BY timestamp ASC) s_url,
first_value(timestamp) OVER (PARTITION BY uid, session_id ORDER BY timestamp ASC) s_timestamp
FROM test
In particular, it seems wrong that I use the OVER
clause twice. Is there a way to select the chronologically first URL and timestamp from a session using a single OVER clause? I'm using SPARK SQL, but I will accept any answer that is applicable to multiple major SQL systems.
A query can contain multiple window functions that slice up the data in different ways using different OVER clauses, but they all act on the same collection of rows defined by this virtual table.
This order of operations implies that you can only use window functions in SELECT and ORDER BY . That is, window functions are not accessible in WHERE , GROUP BY , or HAVING clauses. For this reason, you cannot use any of these functions in WHERE : ROW_NUMBER() , RANK() , DENSE_RANK() , LEAD() , LAG() , or NTILE() .
The reason why window functions are not allowed in GROUP BY is the order of operations in SQL. The clauses of a SQL query are processed in a different order than they are written in the query.
It's possible to use the WINDOW
keyword to name a window that can then be references in the SELECT clause:
SELECT
uid,
session_id,
timestamp,
url,
first_value(url) OVER w s_url,
first_value(timestamp) OVER w s_timestamp
FROM test
WINDOW w AS (PARTITION BY uid, session_id ORDER BY timestamp ASC)
This works in Apache Spark SQL and HiveQL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With