Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate longest streak in SQL?

I have

  TABLE EMPLOYEE - ID,DATE,IsPresent

I want to calculate longest streak for a employee presence.The Present bit will be false for days he didnt come..So I want to calculate the longest number of days he came to office for consecutive dates..I have the Date column field is unique...So I tried this way -

Select Id,Count(*) from Employee where IsPresent=1

But the above doesnt work...Can anyone guide me towards how I can calculate streak for this?....I am sure people have come across this...I tried searching online but...didnt understand it well...please help me out..

like image 371
Vishal Avatar asked Jun 15 '10 22:06

Vishal


People also ask

Where is consecutive date in SQL Server?

The standard gaps-and-island solution is to group by (value minus row_number), since that is invariant within a consecutive sequence. The start and end dates are just the MIN() and MAX() of the group. Very clever solution. Thanks!


2 Answers

groupby is missing.

To select total man-days (for everyone) attendance of the whole office.

Select Id,Count(*) from Employee where IsPresent=1

To select man-days attendance per employee.

Select Id,Count(*)
from Employee
where IsPresent=1
group by id;

But that is still not good because it counts the total days of attendance and NOT the length of continuous attendance.

What you need to do is construct a temp table with another date column date2. date2 is set to today. The table is the list of all days an employee is absent.

create tmpdb.absentdates as
Select id, date, today as date2
from EMPLOYEE
where IsPresent=0
order by id, date;

So the trick is to calculate the date difference between two absent days to find the length of continuously present days. Now, fill in date2 with the next absent date per employee. The most recent record per employee will not be updated but left with value of today because there is no record with greater date than today in the database.

update tmpdb.absentdates
set date2 = 
  select min(a2.date)
  from
   tmpdb.absentdates a1,
   tmpdb.absentdates a2
  where a1.id = a2.id
    and a1.date < a2.date

The above query updates itself by performing a join on itself and may cause deadlock query so it is better to create two copies of the temp table.

create tmpdb.absentdatesX as
Select id, date
from EMPLOYEE
where IsPresent=0
order by id, date;

create tmpdb.absentdates as
select *, today as date2
from tmpdb.absentdatesX;

You need to insert the hiring date, presuming the earliest date per employee in the database is the hiring date.

insert into tmpdb.absentdates a
select a.id, min(e.date), today
from EMPLOYEE e
where a.id = e.id

Now update date2 with the next later absent date to be able to perform date2 - date.

update tmpdb.absentdates
set date2 = 
  select min(x.date)
  from
   tmpdb.absentdates a,
   tmpdb.absentdatesX x
  where a.id = x.id
    and a.date < x.date

This will list the length of days an emp is continuously present:

select id, datediff(date2, date) as continuousPresence
from tmpdb.absentdates
group by id, continuousPresence
order by id, continuousPresence

But you only want to longest streak:

select id, max(datediff(date2, date) as continuousPresence)
from tmpdb.absentdates
group by id
order by id

However, the above is still problematic because datediff does not take into account holidays and weekends.

So we depend on the count of records as the legitimate working days.

create tmpdb.absentCount as
Select a.id, a.date, a.date2, count(*) as continuousPresence
from EMPLOYEE e, tmpdb.absentdates a
where e.id = a.id
  and e.date >= a.date
  and e.date < a.date2
group by a.id, a.date
order by a.id, a.date;

Remember, every time you use an aggregator like count, ave yo need to groupby the selected item list because it is common sense that you have to aggregate by them.

Now select the max streak

select id, max(continuousPresence)
from tmpdb.absentCount
group by id

To list the dates of streak:

select id, date, date2, continuousPresence
from tmpdb.absentCount
group by id
having continuousPresence = max(continuousPresence);

There may be some mistakes (sql server tsql) above but this is the general idea.

like image 184
Blessed Geek Avatar answered Sep 22 '22 23:09

Blessed Geek


EDIT Here's a SQL Server version of the query:

with LowerBound as (select second_day.EmployeeId
        , second_day."DATE" as LowerDate
        , row_number() over (partition by second_day.EmployeeId 
            order by second_day."DATE") as RN
    from T second_day
    left outer join T first_day
        on first_day.EmployeeId = second_day.EmployeeId
        and first_day."DATE" = dateadd(day, -1, second_day."DATE")
        and first_day.IsPresent = 1
    where first_day.EmployeeId is null
    and second_day.IsPresent = 1)
, UpperBound as (select first_day.EmployeeId
        , first_day."DATE" as UpperDate
        , row_number() over (partition by first_day.EmployeeId 
            order by first_day."DATE") as RN
    from T first_day
    left outer join T second_day
        on first_day.EmployeeId = second_day.EmployeeId
        and first_day."DATE" = dateadd(day, -1, second_day."DATE")
        and second_day.IsPresent = 1
    where second_day.EmployeeId is null
    and first_day.IsPresent = 1)
select LB.EmployeeID, max(datediff(day, LowerDate, UpperDate) + 1) as LongestStreak
from LowerBound LB
inner join UpperBound UB
    on LB.EmployeeId = UB.EmployeeId
    and LB.RN = UB.RN
group by LB.EmployeeId

SQL Server Version of the test data:

create table T (EmployeeId int
    , "DATE" date not null
    , IsPresent bit not null 
    , constraint T_PK primary key (EmployeeId, "DATE")
)


insert into T values (1, '2000-01-01', 1);
insert into T values (2, '2000-01-01', 0);
insert into T values (3, '2000-01-01', 0);
insert into T values (3, '2000-01-02', 1);
insert into T values (3, '2000-01-03', 1);
insert into T values (3, '2000-01-04', 0);
insert into T values (3, '2000-01-05', 1);
insert into T values (3, '2000-01-06', 1);
insert into T values (3, '2000-01-07', 0);
insert into T values (4, '2000-01-01', 0);
insert into T values (4, '2000-01-02', 1);
insert into T values (4, '2000-01-03', 1);
insert into T values (4, '2000-01-04', 1);
insert into T values (4, '2000-01-05', 1);
insert into T values (4, '2000-01-06', 1);
insert into T values (4, '2000-01-07', 0);
insert into T values (5, '2000-01-01', 0);
insert into T values (5, '2000-01-02', 1);
insert into T values (5, '2000-01-03', 0);
insert into T values (5, '2000-01-04', 1);
insert into T values (5, '2000-01-05', 1);
insert into T values (5, '2000-01-06', 1);
insert into T values (5, '2000-01-07', 0);

Sorry, this is written in Oracle, so substitute the appropriate SQL Server date arithmetic.

Assumptions:

  • Date is either a Date value or DateTime with time component of 00:00:00.
  • The primary key is (EmployeeId, Date)
  • All fields are not null
  • If a date is missing for the employee, they were not present. (Used to handle the beginning and ending of the data series, but also means that missing dates in the middle will break streaks. Could be a problem depending on requirements.

    with LowerBound as (select second_day.EmployeeId
            , second_day."DATE" as LowerDate
            , row_number() over (partition by second_day.EmployeeId 
                order by second_day."DATE") as RN
        from T second_day
        left outer join T first_day
            on first_day.EmployeeId = second_day.EmployeeId
            and first_day."DATE" = second_day."DATE" - 1
            and first_day.IsPresent = 1
        where first_day.EmployeeId is null
        and second_day.IsPresent = 1)
    , UpperBound as (select first_day.EmployeeId
            , first_day."DATE" as UpperDate
            , row_number() over (partition by first_day.EmployeeId 
                order by first_day."DATE") as RN
        from T first_day
        left outer join T second_day
            on first_day.EmployeeId = second_day.EmployeeId
            and first_day."DATE" = second_day."DATE" - 1
            and second_day.IsPresent = 1
        where second_day.EmployeeId is null
        and first_day.IsPresent = 1)
    select LB.EmployeeID, max(UpperDate - LowerDate + 1) as LongestStreak
    from LowerBound LB
    inner join UpperBound UB
        on LB.EmployeeId = UB.EmployeeId
        and LB.RN = UB.RN
    group by LB.EmployeeId
    

Test Data:

    create table T (EmployeeId number(38) 
        , "DATE" date not null check ("DATE" = trunc("DATE"))
        , IsPresent number not null check (IsPresent in (0, 1))
        , constraint T_PK primary key (EmployeeId, "DATE")
    )
    /

    insert into T values (1, to_date('2000-01-01', 'YYYY-MM-DD'), 1);
    insert into T values (2, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
    insert into T values (3, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
    insert into T values (3, to_date('2000-01-02', 'YYYY-MM-DD'), 1);
    insert into T values (3, to_date('2000-01-03', 'YYYY-MM-DD'), 1);
    insert into T values (3, to_date('2000-01-04', 'YYYY-MM-DD'), 0);
    insert into T values (3, to_date('2000-01-05', 'YYYY-MM-DD'), 1);
    insert into T values (3, to_date('2000-01-06', 'YYYY-MM-DD'), 1);
    insert into T values (3, to_date('2000-01-07', 'YYYY-MM-DD'), 0);
    insert into T values (4, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
    insert into T values (4, to_date('2000-01-02', 'YYYY-MM-DD'), 1);
    insert into T values (4, to_date('2000-01-03', 'YYYY-MM-DD'), 1);
    insert into T values (4, to_date('2000-01-04', 'YYYY-MM-DD'), 1);
    insert into T values (4, to_date('2000-01-05', 'YYYY-MM-DD'), 1);
    insert into T values (4, to_date('2000-01-06', 'YYYY-MM-DD'), 1);
    insert into T values (4, to_date('2000-01-07', 'YYYY-MM-DD'), 0);
    insert into T values (5, to_date('2000-01-01', 'YYYY-MM-DD'), 0);
    insert into T values (5, to_date('2000-01-02', 'YYYY-MM-DD'), 1);
    insert into T values (5, to_date('2000-01-03', 'YYYY-MM-DD'), 0);
    insert into T values (5, to_date('2000-01-04', 'YYYY-MM-DD'), 1);
    insert into T values (5, to_date('2000-01-05', 'YYYY-MM-DD'), 1);
    insert into T values (5, to_date('2000-01-06', 'YYYY-MM-DD'), 1);
    insert into T values (5, to_date('2000-01-07', 'YYYY-MM-DD'), 0);
like image 43
Shannon Severance Avatar answered Sep 22 '22 23:09

Shannon Severance