I have the following event table in Postgres 9.3:
CREATE TABLE event (
event_id integer PRIMARY KEY,
user_id integer,
event_type varchar,
event_time timestamptz
);
My goal is to retrieve all user_id's with a gap of at least 30 days between any of their events (or between their last event and the current time). An additional complication is that I only want the users who have one of these gaps occur at a later time than them performing a certain event_type 'convert'. How can this be done easily?
Some example data in the event table might look like:
INSERT INTO event (event_id, user_id, event_type, event_time)
VALUES
(10, 1, 'signIn', '2015-05-05 00:11'),
(11, 1, 'browse', '2015-05-05 00:12'), -- no 'convert' event
(20, 2, 'signIn', '2015-06-07 02:35'),
(21, 2, 'browse', '2015-06-07 02:35'),
(22, 2, 'convert', '2015-06-07 02:36'), -- only 'convert' event
(23, 2, 'signIn', '2015-08-10 11:00'), -- gap of >= 30 days
(24, 2, 'signIn', '2015-08-11 11:00'),
(30, 3, 'convert', '2015-08-07 02:36'), -- starting with 1st 'convert' event
(31, 3, 'signIn', '2015-08-07 02:36'),
(32, 3, 'convert', '2015-08-08 02:36'),
(33, 3, 'signIn', '2015-08-12 11:00'), -- all gaps below 30 days
(33, 3, 'browse', '2015-08-12 11:00'), -- gap until today (2015-08-20) too small
(40, 4, 'convert', '2015-05-07 02:36'),
(41, 4, 'signIn', '2015-05-12 11:00'); -- gap until today (2015-08-20) >= 30 days
Expected result:
user_id
--------
2
4
One way to do it:
SELECT user_id
FROM (
SELECT user_id
, lead(e.event_time, 1, now()) OVER (PARTITION BY e.user_id ORDER BY e.event_time)
- event_time AS gap
FROM ( -- only users with 'convert' event
SELECT user_id, min(event_time) AS first_time
FROM event
WHERE event_type = 'convert'
GROUP BY 1
) e1
JOIN event e USING (user_id)
WHERE e.event_time >= e1.first_time
) sub
WHERE gap >= interval '30 days'
GROUP BY 1;
The window function lead() allows to include a default value if there is no "next row", which is convenient to cover your additional requirement "or between their last event and the current time".
You should at least have an index on (user_id, event_time) if your table is big:
CREATE INDEX event_user_time_idx ON event(user_id, event_time);
If you do that often and the event_type 'convert' is rare, add another partial index:
CREATE INDEX event_user_time_convert_idx ON event(user_id, event_time)
WHERE event_type = 'convert';
And only if gaps of 30 days are common (not a rare case).
Indexes become even more important.
Try this recursive CTE for better performance:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT DISTINCT ON (user_id)
user_id, event_time, interval '0 days' AS gap
FROM event
WHERE event_type = 'convert'
ORDER BY user_id, event_time
)
UNION ALL
SELECT c.user_id, e.event_time, COALESCE(e.event_time, now()) - c.event_time
FROM cte c
LEFT JOIN LATERAL (
SELECT e.event_time
FROM event e
WHERE e.user_id = c.user_id
AND e.event_time > c.event_time
ORDER BY e.event_time
LIMIT 1 -- the next later event
) e ON true -- add 1 row after last to consider gap till "now"
WHERE c.event_time IS NOT NULL
AND c.gap < interval '30 days'
)
SELECT * FROM cte
WHERE gap >= interval '30 days';
It has considerably more overhead, but can stop - per user - at the first gap that's big enough. If that should be the gap between the last event now, then event_time in the result is NULL.
New SQL Fiddle with more revealing test data demonstrating both queries.
Detailed explanation in these related answers:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With