Im trying to write a function which identifies groups of dates, and measures the size of the group.
I've been doing this procedurally in Python until now but I'd like to move it into SQL.
for example, the list
Bill 01/01/2011
Bill 02/01/2011
Bill 03/01/2011
Bill 05/01/2011
Bill 07/01/2011
should be output into a new table as:
Bill 01/01/2011 3
Bill 02/01/2011 3
Bill 03/01/2011 3
Bill 05/01/2011 1
Bill 07/01/2011 1
Ideally this should also be able to account for weekends and public holidays - the dates in my table will aways be Mon-Fri (I think I can solve this by making a new table of working days and numbering them in sequence). Someone at work suggested I try a CTE. Im pretty new to this, so I'd appreciate any guidance anyone could provide! Thanks.
Using DENSE_RANK Begin by applying the DENSE_RANK function to the rows. To produce the group identifier, we subtract the result of DENSE_RANK from the row value. As the sequence increases, the result of this calculation remains constant but then changes when a new sequence starts.
To group by date part, use the GROUP BY clause and the EXTRACT() function. Pass EXTRACT() the date parts to isolate.
The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country". The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.
You can do this with a clever application of window functions. Consider the following:
select name, date, row_number() over (partition by name order by date)
from t
This adds a row number, which in your example would simply be 1, 2, 3, 4, 5. Now, take the difference from the date, and you have a constant value for the group.
select name, date,
dateadd(d, - row_number() over (partition by name order by date), date) as val
from t
Finally, you want the number of groups in sequence. I would also add a group identifier (for instance, to distinguish between the last two).
select name, date,
count(*) over (partition by name, val) as NumInSeq,
dense_rank() over (partition by name order by val) as SeqID
from (select name, date,
dateadd(d, - row_number() over (partition by name order by date), date) as val
from t
) t
Somehow, I missed the part about weekdays and holidays. This solution does not solve that problem.
The following query account the weekends and holidays. The query has a provision to include the holidays on-the-fly, though for the purpose of making the query clearer, I just materialized the holidays to an actual table.
CREATE TABLE tx
(n varchar(4), d date);
INSERT INTO tx
(n, d)
VALUES
('Bill', '2006-12-29'), -- Friday
-- 2006-12-30 is Saturday
-- 2006-12-31 is Sunday
-- 2007-01-01 is New Year's Holiday
('Bill', '2007-01-02'), -- Tuesday
('Bill', '2007-01-03'), -- Wednesday
('Bill', '2007-01-04'), -- Thursday
('Bill', '2007-01-05'), -- Friday
-- 2007-01-06 is Saturday
-- 2007-01-07 is Sunday
('Bill', '2007-01-08'), -- Monday
('Bill', '2007-01-09'), -- Tuesday
('Bill', '2012-07-09'), -- Monday
('Bill', '2012-07-10'), -- Tuesday
('Bill', '2012-07-11'); -- Wednesday
create table holiday(d date);
insert into holiday(d) values
('2007-01-01');
/* query should return 7 consecutive good
attendance(from December 29 2006 to January 9 2007) */
/* and 3 consecutive attendance from July 7 2012 to July 11 2012. */
Query:
with first_date as
(
-- get the monday of the earliest date
select dateadd( ww, datediff(ww,0,min(d)), 0 ) as first_date
from tx
)
,shifted as
(
select
tx.n, tx.d,
diff = datediff(day, fd.first_date, tx.d)
- (datediff(day, fd.first_date, tx.d)/7 * 2)
from tx
cross join first_date fd
union
select
xxx.n, h.d,
diff = datediff(day, fd.first_date, h.d)
- (datediff(day, fd.first_date, h.d)/7 * 2)
from holiday h
cross join first_date fd
cross join (select distinct n from tx) as xxx
)
,grouped as
(
select *, grp = diff - row_number() over(partition by n order by d)
from shifted
)
select
d, n, dense_rank() over (partition by n order by grp) as nth_streak
,count(*) over (partition by n, grp) as streak
from grouped
where d not in (select d from holiday) -- remove the holidays
Output:
| D | N | NTH_STREAK | STREAK |
-------------------------------------------
| 2006-12-29 | Bill | 1 | 7 |
| 2007-01-02 | Bill | 1 | 7 |
| 2007-01-03 | Bill | 1 | 7 |
| 2007-01-04 | Bill | 1 | 7 |
| 2007-01-05 | Bill | 1 | 7 |
| 2007-01-08 | Bill | 1 | 7 |
| 2007-01-09 | Bill | 1 | 7 |
| 2012-07-09 | Bill | 2 | 3 |
| 2012-07-10 | Bill | 2 | 3 |
| 2012-07-11 | Bill | 2 | 3 |
Live test: http://www.sqlfiddle.com/#!3/815c5/1
The main logic of the query is to shift all the dates two days back. This is done by dividing the date to 7 and multiplying it by two, then subtracting it from the original number. For example, if a given date falls on 15th, this will be computed as 15/7 * 2 == 4; then subtract 4 from the original number, 15 - 4 == 11. 15 will become the 11th day. Likewise the 8th day becomes the 6th day; 8 - (8/7 * 2) == 6.
Weekends are not in attendance(e.g. 6,7,13,14)
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15
Applying the computation to all the weekday numbers will yield these values:
1 2 3 4 5
6 7 8 9 10
11
For holidays, you need to slot them on attendance, so to the consecutive-ness could be easily determined, then just remove them from the final query. The above attendance yields 11 consecutive good attendance.
Query logic's detailed explanation here: http://www.ienablemuch.com/2012/07/monitoring-perfect-attendance.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With