We are logging the main flows of actions our users make on our iPad app on a table. Each flow has a start(tagged Started) and an end that is either tagged Cancelled or Finished, and there shouldn't be any overlapping events.
A set of flows Started, Cancelled or Finished for a user looks like this:
user_id timestamp event_text event_num
[email protected] 2016-10-30 00:08:00.966+00 Flow Started 0
[email protected] 2016-10-30 00:08:15.58+00 Flow Cancelled 2
[email protected] 2016-10-30 00:08:15.581+00 Flow Started 0
[email protected] 2016-10-30 00:34:44.134+00 Flow Finished 1
[email protected] 2016-10-30 00:42:26.102+00 Flow Started 0
[email protected] 2016-10-30 00:42:49.276+00 Flow Cancelled 2
[email protected] 2016-10-30 00:42:49.277+00 Flow Started 0
[email protected] 2016-10-30 00:59:47.337+00 Flow Cancelled 2
[email protected] 2016-10-30 00:59:47.337+00 Flow Started 0
[email protected] 2016-10-30 00:59:47.928+00 Flow Cancelled 2
We want to calculate how long a cancelled and finished flow last on average. For this we need to pair event Started with Canceled or Finished. The following code does that, however can't work around the following data quality issue that we have:
When a customer wants to start a new flow(let's call it Flow2) before ending the ongoing flow (Flow1), we shoot a cancelled event as we shoot the started event for the new flow. So Flow1 Cancelled=Flow2 Started
. However when we use window functions to order and lead/lag between ordered events that actually belong to different flows get matched.
By using this code:
WITH track_scf AS (SELECT user_id, timestamp, event_text, CASE WHEN event_text LIKE '%Started%' THEN 0 when event_text like '%Cancelled%' then 2 ELSE 1 END AS event_num FROM tracks ORDER BY 2, 4 desc ) SELECT user_id, CASE WHEN event_num=0 then timestamp end as start,CASE WHEN LEAD(event_num, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) <> 0 THEN LEAD(timestamp, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) END as end, CASE WHEN LEAD(event_num, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) <> 0 THEN LEAD(event_num, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) END as action FROM track_scf
We get this result:
user_id start end action
[email protected] 2016-10-30 00:08:00.966+00 2016-10-30 00:08:15.58+00 2
[email protected] 2016-10-30 00:08:15.581+00 2016-10-30 00:34:44.134+00 1
[email protected] 2016-10-30 00:42:26.102+00 2016-10-30 00:42:49.276+00 2
[email protected] 2016-10-30 00:42:49.277+00 NULL NULL
[email protected] 2016-10-30 00:59:47.337+00 2016-10-30 00:59:47.337+00 2
[email protected] NULL 2016-10-30 00:59:47.928+00 2
But we should get this:
user_id start end action
[email protected] 2016-10-30 00:08:00.966+00 2016-10-30 00:08:15.58+00 2
[email protected] 2016-10-30 00:08:15.581+00 2016-10-30 00:34:44.134+00 1
[email protected] 2016-10-30 00:42:26.102+00 2016-10-30 00:42:49.276+00 2
[email protected] 2016-10-30 00:42:49.277+00 2016-10-30 00:59:47.337+00 2
[email protected] 2016-10-30 00:59:47.337+00 2016-10-30 00:59:47.928+00 2
How do I need to alter the code so that the pairing is correct?
select user_id
,"start"
,"end"
,"action"
from (select user_id
,timestamp as "start"
,lead (event_num) over w as "action"
,lead ("timestamp") over w as "end"
,event_num
from tracks t
window w as (partition by user_id order by "timestamp",event_num desc)
) t
where t.event_num = 0
;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With