Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I identify groups of consecutive dates in SQL?

Im trying to write a function which identifies groups of dates, and measures the size of the group.

I've been doing this procedurally in Python until now but I'd like to move it into SQL.

for example, the list

Bill 01/01/2011 
Bill 02/01/2011 
Bill 03/01/2011 
Bill 05/01/2011 
Bill 07/01/2011 

should be output into a new table as:

Bill 01/01/2011  3 
Bill 02/01/2011  3 
Bill 03/01/2011  3 
Bill 05/01/2011  1 
Bill 07/01/2011  1

Ideally this should also be able to account for weekends and public holidays - the dates in my table will aways be Mon-Fri (I think I can solve this by making a new table of working days and numbering them in sequence). Someone at work suggested I try a CTE. Im pretty new to this, so I'd appreciate any guidance anyone could provide! Thanks.

like image 896
Pythonn00b Avatar asked Jul 10 '12 14:07

Pythonn00b


People also ask

How do you find consecutive data in SQL?

Using DENSE_RANK Begin by applying the DENSE_RANK function to the rows. To produce the group identifier, we subtract the result of DENSE_RANK from the row value. As the sequence increases, the result of this calculation remains constant but then changes when a new sequence starts.

Can you GROUP BY date in SQL?

To group by date part, use the GROUP BY clause and the EXTRACT() function. Pass EXTRACT() the date parts to isolate.

What is the grouping syntax in SQL?

The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country". The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.


2 Answers

You can do this with a clever application of window functions. Consider the following:

select name, date, row_number() over (partition by name order by date)
from t

This adds a row number, which in your example would simply be 1, 2, 3, 4, 5. Now, take the difference from the date, and you have a constant value for the group.

select name, date,
       dateadd(d, - row_number() over (partition by name order by date), date) as val
from t

Finally, you want the number of groups in sequence. I would also add a group identifier (for instance, to distinguish between the last two).

select name, date,
       count(*) over (partition by name, val) as NumInSeq,
       dense_rank() over (partition by name order by val) as SeqID
from (select name, date,
             dateadd(d, - row_number() over (partition by name order by date), date) as val
      from t
     ) t

Somehow, I missed the part about weekdays and holidays. This solution does not solve that problem.

like image 82
Gordon Linoff Avatar answered Sep 28 '22 01:09

Gordon Linoff


The following query account the weekends and holidays. The query has a provision to include the holidays on-the-fly, though for the purpose of making the query clearer, I just materialized the holidays to an actual table.

CREATE TABLE tx
    (n varchar(4), d date);

INSERT INTO tx
    (n, d)
VALUES
    ('Bill', '2006-12-29'), -- Friday    
    -- 2006-12-30 is Saturday
    -- 2006-12-31 is Sunday

    -- 2007-01-01 is New Year's Holiday    
    ('Bill', '2007-01-02'), -- Tuesday
    ('Bill', '2007-01-03'), -- Wednesday
    ('Bill', '2007-01-04'), -- Thursday
    ('Bill', '2007-01-05'), -- Friday    
    -- 2007-01-06 is Saturday
    -- 2007-01-07 is Sunday

    ('Bill', '2007-01-08'), -- Monday
    ('Bill', '2007-01-09'), -- Tuesday

    ('Bill', '2012-07-09'), -- Monday
    ('Bill', '2012-07-10'), -- Tuesday
    ('Bill', '2012-07-11'); -- Wednesday

create table holiday(d date);
insert into holiday(d) values
('2007-01-01');


/* query should return 7 consecutive good 
   attendance(from December 29 2006 to January 9 2007) */    
/* and 3 consecutive attendance from July 7 2012 to July 11 2012. */

Query:

with first_date as
(
    -- get the monday of the earliest date
    select dateadd( ww, datediff(ww,0,min(d)), 0 ) as first_date 
    from tx 
)
,shifted as
(
    select 
        tx.n, tx.d,                     
        diff = datediff(day, fd.first_date, tx.d) 
                   - (datediff(day, fd.first_date, tx.d)/7 * 2)             
    from tx
    cross join first_date fd
    union
    select 
        xxx.n, h.d,             
        diff = datediff(day, fd.first_date, h.d) 
                   - (datediff(day, fd.first_date, h.d)/7 * 2) 
    from holiday h 
    cross join first_date fd
    cross join (select distinct n from tx) as xxx
)
,grouped as
(
    select *, grp = diff - row_number() over(partition by n order by d)
    from shifted
)
select 
    d, n, dense_rank() over (partition by n order by grp) as nth_streak
    ,count(*) over (partition by n, grp) as streak
from grouped
where d not in (select d from holiday)  -- remove the holidays

Output:

|          D |    N | NTH_STREAK | STREAK |
-------------------------------------------
| 2006-12-29 | Bill |          1 |      7 |
| 2007-01-02 | Bill |          1 |      7 |
| 2007-01-03 | Bill |          1 |      7 |
| 2007-01-04 | Bill |          1 |      7 |
| 2007-01-05 | Bill |          1 |      7 |
| 2007-01-08 | Bill |          1 |      7 |
| 2007-01-09 | Bill |          1 |      7 |
| 2012-07-09 | Bill |          2 |      3 |
| 2012-07-10 | Bill |          2 |      3 |
| 2012-07-11 | Bill |          2 |      3 |

Live test: http://www.sqlfiddle.com/#!3/815c5/1

The main logic of the query is to shift all the dates two days back. This is done by dividing the date to 7 and multiplying it by two, then subtracting it from the original number. For example, if a given date falls on 15th, this will be computed as 15/7 * 2 == 4; then subtract 4 from the original number, 15 - 4 == 11. 15 will become the 11th day. Likewise the 8th day becomes the 6th day; 8 - (8/7 * 2) == 6.

                  Weekends are not in attendance(e.g. 6,7,13,14)
 1  2  3  4  5     6  7
 8  9 10 11 12    13 14
15

Applying the computation to all the weekday numbers will yield these values:

 1  2  3  4  5    
 6  7  8  9 10    
11

For holidays, you need to slot them on attendance, so to the consecutive-ness could be easily determined, then just remove them from the final query. The above attendance yields 11 consecutive good attendance.

Query logic's detailed explanation here: http://www.ienablemuch.com/2012/07/monitoring-perfect-attendance.html

like image 21
Michael Buen Avatar answered Sep 28 '22 02:09

Michael Buen