Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL: finding longest date gap

Tags:

sql

I have a table with 2 fields: unique ID, user ID (foreign key) and date-time. This is an access-log to a service. I work in SQL Server but I would appreciate agnostic answers.

I would like using SQL to find for a certain user the ID from where the longest gap begins.

So for example, say my values are as follows (simplification for one user):

ID |  User-ID |  Time
----------------------------------
1  |  1       |  11-MAR-09, 8:00am
2  |  1       |  11-MAR-09, 6:00pm
3  |  1       |  13-MAR-09, 7:00pm
4  |  1       |  14-MAR-09, 6:00pm

If I search for the longest gap for user 1 I will get ID 2 (it would also be nice to get the length of the gap right there and then, but much less critical).

What's the most efficient way to achieve this in SQL?

Note: ID is not necessarily sequential.

Thank you

like image 401
Roee Adler Avatar asked Aug 22 '09 05:08

Roee Adler


2 Answers

Database-agnostic, something of a variant of richardtallent's, but without the restrictions. (I'm using SQL Server 2008 here, but it shouldn't matter.)

Starting with this setup:

create table test(id int, userid int, time datetime)
insert into test values (1, 1, '2009-03-11 08:00')
insert into test values (2, 1, '2009-03-11 18:00')
insert into test values (3, 1, '2009-03-13 19:00')
insert into test values (4, 1, '2009-03-14 18:00')

Running this query:

select 
  starttime.id as gapid, starttime.time as starttime, endtime.time as endtime, 
  /* Replace next line with your DB's way of calculating the gap */
  DATEDIFF(second, starttime.time, endtime.time) as gap
from 
  test as starttime
inner join test as endtime on 
  (starttime.userid = endtime.userid) 
  and (starttime.time < endtime.time) 
left join test as intermediatetime on 
  (starttime.userid = intermediatetime.userid) 
  and (starttime.time < intermediatetime.time) 
  and (intermediatetime.time < endtime.time) 
where 
  (intermediatetime.id is null)

Gives the following:

gapid  starttime                endtime                  gap
1      2009-03-11 08:00:00.000  2009-03-11 18:00:00.000  36000
2      2009-03-11 18:00:00.000  2009-03-13 19:00:00.000  176400
3      2009-03-13 19:00:00.000  2009-03-14 18:00:00.000  82800

You can then just ORDER BY the gap expression descending, and pick the top result.

Some explanation:

  • Like richardtallent's answer, you join the table onto itself to find a 'later' record – this basically pairs all records with ANY of their later records, here pairing {1+2, 1+3, 1+4, 2+3, 2+4, 3+4}.
  • Then there's another self-join, this time a left join, to find rows in between the two previously selected so {1+2+null, 1+3+2, 1+4+2, 1+4+3, 2+3+null, 2+4+3, 3+4+null}.
  • The WHERE clause, though, filters these out (keeps only the rows with no intermediate row), hence keeping only {1+2+null, 2+3+null, 3+4+null}. Taa-daa!

If you could, potentially, have the same time in there twice (a 'gap' of 0) then you'll need a way to break ties, as Dems points out. If you can use ID as a tie-breaker, then change e.g.

and (starttime.time < intermediatetime.time) 

to

and ((starttime.time < intermediatetime.time) 
  or ((starttime.time = intermediatetime.time) and (starttime.id < intermediatetime.id)))

assuming that 'id' is a valid way to break ties.

In fact, if you know that ID will be monotonically increasing (I know you said 'not sequential,' but it's not clear if this means that they don't increase with each row, or just that the IDs of the two relevant entries may not be sequential because e.g. another user has entries in between), you can use ID instead of time in all the comparisons to make this even simpler.

like image 61
Cowan Avatar answered Oct 19 '22 06:10

Cowan


Join ranked Time on one-off rank to get the gap:

with cte_ranked as (
select *, row_number() over (partition by UserId order by Time) as rn
from table)
select l.*, datediff(minute, r.Time, l.Time) as gap_length
from cte_ranked l join cte_ranked r on l.UserId = r.UserId and l.rn = r.rn-1

You can then use many methods to identify the maximum gap, when it started etc.

Update

My original answer was written from a Mac w/o a database to test with. I had some more time to play with this problem and actually test and measure how it performs on a 1M records table. My test table is defined like this:

create table access (id int identity(1,1)
    , UserId int not null
    , Time datetime not null);
create clustered index cdx_access on access(UserID, Time);
go

For selecting the record for any information, my preferred answer so far is this:

with cte_gap as (
    select Id, UserId, a.Time, (a.Time - prev.Time) as gap
    from access a
    cross apply (
        select top(1) Time 
        from access b
        where a.UserId = b.UserId
            and a.Time > b.Time
        order by Time desc) as prev)
, cte_max_gap as (
    select UserId, max(gap) as max_gap
    from cte_gap
    group by UserId)
select g.* 
    from cte_gap g
    join cte_max_gap m on m.UserId = g.UserId and m.max_gap = g.gap
where g.UserId = 42;

From 1M record, ~47k distinct users, the result for this is returned in 1ms on my test puny instance (warm cache), 48 page reads.

If the UserId=42 filter is removed the max gap and time it occurred for every user (with duplicates for multiple max gaps) need 6379139 reads, quite heavy, and takes 14s on my test machine.

The time can be cut in half if only the UserId and max gap is needed (no info when the max gap occurred):

select UserId, max(a.Time-prev.Time) as gap
    from access a
    cross apply (
        select top(1) Time 
        from access b
        where a.UserId = b.UserId
            and a.Time > b.Time
        order by Time desc
    ) as prev
group by UserId

This only needs 3193448 reads, only half compared to previous, and completed in 6 seconds on 1M records. The difference occurs because the previous version needed to evaluate every gap once to find the max one, then evaluate them again to find the ones that equal with the max. Note that for this performance results the structure of the table I proposed with an index on (UserId, Time) is critical.

As for the use of CTEs and 'partitions' (better known as ranking functions): this is all ANSI SQL-99 and is supported by most vendors. The only SQL Server specific construct was the use of the datediff function, which is now removed. I have a feeling some readers understand 'agnostic' as 'least common denominator SQL understood also by my favorite vendor'. Also note that the use of common table expressions and cross apply operator are used solely to improve the readability of the query. Both can be replaced with derived table using a simple, mechanical, replacement. Here is the very same query where the CTEs where replaced with derived tables. I'll let you judge yourselves on its readability compared with the CTE based one:

select g.*
    from (    
        select Id, UserId, a.Time, (a.Time - (
            select top(1) Time 
            from access b
            where a.UserId = b.UserId
                and a.Time > b.Time
            order by Time desc
        )) as gap
        from access a) as g
    join (
        select UserId, max(gap) as max_gap
            from (
                select Id, UserId, a.Time, (a.Time - (
                   select top(1) Time 
                   from access b
                   where a.UserId = b.UserId
                     and a.Time > b.Time
                   order by Time desc
                   )) as gap
            from access a) as cte_gap
        group by UserId) as m on m.UserId = g.UserId and m.max_gap = g.gap
    where g.UserId = 42

Damn, I was hopping will end up more convoluted lol. This is quite readable because it had only two CTEs to start from. Still, on queries with 5-6 derived tables, the CTE form is way, way more readable.

For completeness, here is the same transformation applied to my simplified query (only max gaps, no gap end time and access id):

select UserId, max(gap)
    from (
        select UserId, a.Time-(
            select top(1) Time 
            from access b
            where a.UserId = b.UserId
                and a.Time > b.Time
            order by Time desc) as gap
    from access a) as gaps
group by UserId
like image 27
Remus Rusanu Avatar answered Oct 19 '22 06:10

Remus Rusanu