Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapse multiple rows having contiguous timestamps

Tags:

sql

postgresql

I have a table that represents usage of a product, kind of like a log. Product usage is logged as multiple timestamps, I want to represent the same data using time ranges.

It looks like this (PostgreSQL 9.1):

userid | timestamp          | product
-------------------------------------
001    | 2012-04-23 9:12:05 | foo
001    | 2012-04-23 9:12:07 | foo
001    | 2012-04-23 9:12:09 | foo
001    | 2012-04-23 9:12:11 | barbaz
001    | 2012-04-23 9:12:13 | barbaz
001    | 2012-04-23 9:15:00 | barbaz
001    | 2012-04-23 9:15:01 | barbaz
002    | 2012-04-24 3:41:01 | foo
002    | 2012-04-24 3:41:03 | foo

I want to collapse rows whose time difference with the previous run is less than a delta (say: 2 seconds), and get the begin time and the end time, like this:

userid | begin              | end                | product
----------------------------------------------------------
001    | 2012-04-23 9:12:05 | 2012-04-23 9:12:09 | foo
001    | 2012-04-23 9:12:11 | 2012-04-23 9:12:13 | barbaz
001    | 2012-04-23 9:15:00 | 2012-04-23 9:15:01 | barbaz
002    | 2012-04-24 3:41:01 | 2012-04-24 3:41:03 | foo

Please note that consecutive usage of the same product is split into two rows if their usage is more than delta (2 seconds, in this example) apart.

create table t (userid int, timestamp timestamp, product text);

insert into t (userid, timestamp, product) values 
(001, '2012-04-23 9:12:05', 'foo'),
(001, '2012-04-23 9:12:07', 'foo'),
(001, '2012-04-23 9:12:09', 'foo'),
(001, '2012-04-23 9:12:11', 'barbaz'),
(001, '2012-04-23 9:12:13', 'barbaz'),
(001, '2012-04-23 9:15:00', 'barbaz'),
(001, '2012-04-23 9:15:01', 'barbaz'),
(002, '2012-04-24 3:41:01', 'foo'),
(002, '2012-04-24 3:41:03', 'foo')
;
like image 323
Giuseppe Cardone Avatar asked Jun 25 '12 09:06

Giuseppe Cardone


1 Answers

Inspired by this answer, given a while back by @a_horse_with_no_name.

WITH groupped_t AS (
SELECT *, sum(grp_id) OVER (ORDER BY userid,product,"timestamp") AS grp_nr
  FROM (SELECT t.*,
          lag("timestamp") OVER
           (PARTITION BY userid,product ORDER BY "timestamp") AS prev_ts,
          CASE WHEN ("timestamp" - lag("timestamp") OVER
            (PARTITION BY userid,product ORDER BY "timestamp")) <= '2s'::interval
          THEN NULL ELSE 1 END AS grp_id
        FROM t) AS g
), periods AS (
SELECT min(gt."timestamp") AS grp_min, max(gt."timestamp") AS grp_max, grp_nr
  FROM groupped_t AS gt
 GROUP BY gt.grp_nr
)
SELECT gt.userid, p.grp_min AS "begin", p.grp_max AS "end", gt.product
  FROM periods p
  JOIN groupped_t gt ON gt.grp_nr = p.grp_nr AND gt."timestamp" = p.grp_min
 ORDER BY gt.userid, p.grp_min;
  1. The innermost query will assign groupping IDs based on the userid, product and time difference. I assumed it should be safe to PARTITION BY first two fields in fact.
  2. groupped_t gives me all the source columns + an extra running group number. I used only ORDER BY here for the sum() window function, as I need group IDs to be unique.
  3. periods is just a helper query for the first and last timestamp in each group.
  4. Finally, I join groupped_t with periods on the grp_nr (that's why I needed it to be unique) and a timestamp of the first entry in each group.

You can also check this query on SQL Fiddle.

Note, that timestamp, begin and end are reserved words in the SQL (end also for PostgreSQL), so you should either avoid or double-quote them.

like image 147
vyegorov Avatar answered Sep 28 '22 07:09

vyegorov