Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding gaps in huge event streams?

I have about 1 million events in a PostgreSQL database that are of this format:

id        |   stream_id     |  timestamp
----------+-----------------+-----------------
1         |   7             |  ....
2         |   8             |  ....

There are about 50,000 unique streams.

I need to find all of the events where the time between any two of the events is over a certain time period. In other words, I need to find event pairs where there was no event in a certain period of time.

For example:

a b c d   e     f              g         h   i  j k
| | | |   |     |              |         |   |  | | 

                \____2 mins____/

In this scenario, I would want to find the pair (f, g) since those are the events immediately surrounding a gap.

I don't care if the query is (that) slow, i.e. on 1 million records it's fine if it takes an hour or so. However, the data set will keep growing, so hopefully if it's slow it scales sanely.

I also have the data in MongoDB.

What's the best way to perform this query?

like image 985
MikeC8 Avatar asked Jun 03 '15 04:06

MikeC8


2 Answers

You can do this with the lag() window function over a partition by the stream_id which is ordered by the timestamp. The lag() function gives you access to previous rows in the partition; without a lag value, it is the previous row. So if the partition on stream_id is ordered by time, then the previous row is the previous event for that stream_id.

SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
       ("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");
like image 127
Patrick Avatar answered Nov 04 '22 11:11

Patrick


In postgres it can be done very easily with a help of the lag() window function. Check the fiddle below as an example:

SQL Fiddle

PostgreSQL 9.3 Schema Setup:

CREATE TABLE Table1
    ("id" int, "stream_id" int, "timestamp" timestamp)
;

INSERT INTO Table1
    ("id", "stream_id", "timestamp")
VALUES
    (1, 7, '2015-06-01 15:20:30'),
    (2, 7, '2015-06-01 15:20:31'),
    (3, 7, '2015-06-01 15:20:32'),
    (4, 7, '2015-06-01 15:25:30'),
    (5, 7, '2015-06-01 15:25:31')
;

Query 1:

with c as (select *,
           lag("timestamp") over(partition by stream_id order by id) as pre_time,
           lag(id) over(partition by stream_id order by id) as pre_id
           from Table1
          )
select * from c where "timestamp" - pre_time > interval '2 sec'

Results:

| id | stream_id |              timestamp |               pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
|  4 |         7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 |      3 |
like image 32
cha Avatar answered Nov 04 '22 09:11

cha