Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server : finding gaps in employment - island and gap problem

I have been going through stack overflow to try and work this out over the last week and I still can't work out a viable solution so was wondering if anyone could offer me some help/advice?

Explanation of the data structures

I have the following tables:

Position table (zz_position) which is used to hold the details of the position (Job ID) include the date range that it is valid for.

PosNo   Description                Date_From    Date_To 
---------------------------------------------------------
10001   System Administrator       20170101     20231231

Resource table (zz_resource) which is used to hold the details of a resource (employee) including the date that they joined the company and left it

resID   description  date_from   date_to
------------------------------------------
100     Sam          20160101    20991231
101     Joe          20150101    20991231 

Employment table (zz_employment) which is used to link position to resources within a date from and to range

PosNo    resID       Date_From   Date_To     seqNo
---------------------------------------------------
10001    100         20180101    20180401    1
10001    101         20180601    20191231    2
10001    100         20200101    20991231    3

Problem

Now due to people changing positions, a post might not be filled for a period of time and what I am trying to do is produce a report that I can use to give me a breakdown of the status of a post at any point in time.

I know that I can produce one which fully maps each day using a calendar table however what I want is a report which produces the data in the following aggregated format:

PosNo   resID      Date_From   Date_To    seqNo
-------------------------------------------------
10001   NULL       20170101    20171231   0
10001   100        20180101    20180401   1
10001   NULL       20180402    20180530   0
10001   101        20180601    20191231   2
10001   100        20200101    20231231   3


insert into zz_employment 
values ('10001', '100', '2018-01-01 00:00:00.000', '2018-04-01 00:00:00.000', 1),
       ('10001', '101', '2018-06-01 00:00:00.000', '2019-12-31 00:00:00.000', 2),
       ('10001', '100', '2020-01-01 00:00:00.000', '2099-12-31 00:00:00.000', 3)

(note how the report has taken the two lines in the table and produced a fully speced out life of the employment where the first null line date from is pulled from the position start date and the last line date to is pulled from the position end date.

Ideally I would like this as a view/function however due to the complexity I would be more than happy to have a series of T SQL statements that I can run each night as part of a data warehouse routine.

Rules

  • all dates are truncated to datetime so that an date_to is referencing the date that it ends not the date and time that it ends
  • if the post/employment/resource has no end date then it will be denoted as 20991231
  • if the employment itself is open ended then the date to in the employment table is denoted as 20991231 even through the position itself might end in 20231231. Ideally I would like the result to respect the position end date.

SQL code:

CREATE TABLE zz_position  
(
     posNo varchar(25) NOT NULL,  
     description varchar(25) NOT NULL,  
     date_from datetime NULL,  
     date_to datetime NULL
) 

insert into zz_position 
values ('10001', 'System Administrator', '2017-01-01 00:00:00.000', '2020-12-31 00:00:00.000')
go

CREATE TABLE zz_resource
(
     resID varchar(25) NOT NULL,  
     description varchar(25) NOT NULL,  
     date_from datetime NULL,  
     date_to datetime NULL
)  

insert into zz_resource 
values ('100', 'Sam', '2016-01-01 00:00:00.000', '2099-12-31 00:00:00.000'),
       ('101', 'Joe', '2015-01-01 00:00:00.000', '2099-12-31 00:00:00.000')
go

CREATE TABLE zz_employment
(
      posNo varchar(25) NOT NULL,  
      resID varchar(25) NOT NULL,  
      date_from datetime NULL,  
      date_to datetime NULL,
      seqNo int NULL
)  

insert into zz_employment 
values ('10001', '100', '2018-01-01 00:00:00.000', '2018-04-01 00:00:00.000', 1),
       ('10001', '101', '2018-06-01 00:00:00.000', '2019-12-31 00:00:00.000', 2),
       ('10001', '100', '2020-01-01 00:00:00.000', '2099-12-31 00:00:00.000', 3)
like image 945
Reky Avatar asked Feb 19 '19 16:02

Reky


1 Answers

There are 2 caveats for this problem:

  • A calendar table.
  • A way to correctly group unemployed periods when there's an employed period in between.

The following solution uses a calendar table (SQL included) and an DATEDIFF() with anchor-date trick to group correctly for the 2nd point.

Complete DB Fiddle here.

Solution (explanation below):

;WITH AllPositionDates AS
(
    SELECT
        T.posNo,
        C.GeneratedDate
    FROM
        zz_position AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
AllEmployedDates AS
(
    SELECT
        T.posNo,
        T.resID,
        T.seqNo,
        C.GeneratedDate
    FROM
        zz_employment AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
PositionsByEmployed AS
(
    SELECT
        P.posNo,
        P.GeneratedDate,
        E.resID,
        E.seqNo,
        NullRowNumber = ROW_NUMBER() OVER (
            PARTITION BY
                P.posNo,
                CASE WHEN E.posNo IS NULL THEN 1 ELSE 2 END
            ORDER BY
                P.GeneratedDate ASC)
    FROM
        AllPositionDates AS P
        LEFT JOIN AllEmployedDates AS E ON
            P.posNo = E.posNo AND
            P.GeneratedDate = E.GeneratedDate
)
SELECT
    P.posNo,
    P.resID,
    Date_From = MIN(P.GeneratedDate),
    Date_To = MAX(P.GeneratedDate),
    seqNo = ISNULL(P.seqNo, 0)
FROM
    PositionsByEmployed AS P
GROUP BY
    P.posNo,
    P.resID,
    P.seqNo,
    CASE WHEN P.resId IS NULL THEN P.NullRowNumber - DATEDIFF(DAY, '2000-01-01', P.GeneratedDate) END -- GroupingValueGroupingValue
ORDER BY
    P.posNo,
    Date_From,
    Date_To

The result:

posNo   resID   Date_From   Date_To     seqNo
10001   NULL    2017-01-01  2017-12-31  0
10001   100     2018-01-01  2018-04-01  1
10001   NULL    2018-04-02  2018-05-31  0
10001   101     2018-06-01  2019-12-31  2
10001   100     2020-01-01  2020-12-31  3

Explanation

First the creating of a calendar table. This holds 1 row for each day and in this example it's limited to the first and last possible day of the job positions:

DECLARE @DateStart DATE = (SELECT MIN(P.date_from) FROM zz_position AS P)
DECLARE @DateEnd DATE = (SELECT(MAX(P.date_to)) FROM zz_position AS P)

;WITH GeneratedDates AS
(
    SELECT
        GeneratedDate = @DateStart

    UNION ALL

    SELECT
        GeneratedDate = DATEADD(DAY, 1, G.GeneratedDate)
    FROM
        GeneratedDates AS G
    WHERE
        DATEADD(DAY, 1, G.GeneratedDate) <= @DateEnd
)
SELECT
    DateID = IDENTITY(INT, 1, 1),
    G.GeneratedDate
INTO
    Calendar
FROM
    GeneratedDates AS G
OPTION
    (MAXRECURSION 0)

This generates the following (up to 2020-12-31, which is max date from sample data):

DateID  GeneratedDate
1       2017-01-01
2       2017-01-02
3       2017-01-03
4       2017-01-04
5       2017-01-05
6       2017-01-06
7       2017-01-07

Now we use a join with a between to "spread" the periods of both the positions and the employees periods (on different CTEs), so we get 1 row for each day, for each position/employee.

-- AllPositionDates
SELECT
    T.posNo,
    C.GeneratedDate
FROM
    zz_position AS T
    INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to

-- AllEmployedDates
SELECT
    T.posNo,
    T.resID,
    T.seqNo,
    C.GeneratedDate
FROM
    zz_employment AS T
    INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to

With these, we join them together by position and date using LEFT JOIN, so we get all days of each position and the matching employee (if exists). We also calculate a row number for all NULL values for each position that we are gonna use later. Note that this row number increases 1 by 1 with each following date accordingly.

;WITH AllPositionDates AS
(
    SELECT
        T.posNo,
        C.GeneratedDate
    FROM
        zz_position AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
AllEmployedDates AS
(
    SELECT
        T.posNo,
        T.resID,
        T.seqNo,
        C.GeneratedDate
    FROM
        zz_employment AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
)
-- PositionsByEmployee
SELECT
    P.posNo,
    P.GeneratedDate,
    E.resID,
    E.seqNo,
    NullRowNumber = ROW_NUMBER() OVER (
        PARTITION BY
            P.posNo,
            CASE WHEN E.posNo IS NULL THEN 1 ELSE 2 END
        ORDER BY
            P.GeneratedDate ASC)
    FROM
        AllPositionDates AS P
        LEFT JOIN AllEmployedDates AS E ON
            P.posNo = E.posNo AND
            P.GeneratedDate = E.GeneratedDate

Now with the tricky part. If we calculate the amount of days of difference between a hard-coded date and each day, we get a similar "row number" that increases consistently for each date.

SELECT
    P.posNo,
    P.GeneratedDate,
    DateDiff = DATEDIFF(DAY, '2000-01-01', P.GeneratedDate),
    P.NullRowNumber
FROM
    PositionsByEmployed AS P -- This is declare with the WITH (full solution below)
ORDER BY
    P.posNo,
    P.GeneratedDate

We get the following:

posNo   GeneratedDate   DateDiff    NullRowNumber
10001   2017-01-01      6210        1
10001   2017-01-02      6211        2
10001   2017-01-03      6212        3
10001   2017-01-04      6213        4
10001   2017-01-05      6214        5
10001   2017-01-06      6215        6
10001   2017-01-07      6216        7
10001   2017-01-08      6217        8
10001   2017-01-09      6218        9

If we add another column with the rest of these 2 you will see that the value remains the same:

SELECT
    P.posNo,
    P.GeneratedDate,
    DateDiff = DATEDIFF(DAY, '2000-01-01', P.GeneratedDate),
    P.NullRowNumber,
    GroupingValue = P.NullRowNumber - DATEDIFF(DAY, '2000-01-01', P.GeneratedDate)
FROM
    PositionsByEmployed AS P
ORDER BY
    P.posNo,
    P.GeneratedDate

We get:

posNo   GeneratedDate   DateDiff    NullRowNumber   GroupingValue
10001   2017-01-01      6210        1               -6209
10001   2017-01-02      6211        2               -6209
10001   2017-01-03      6212        3               -6209
10001   2017-01-04      6213        4               -6209
10001   2017-01-05      6214        5               -6209
10001   2017-01-06      6215        6               -6209
10001   2017-01-07      6216        7               -6209
10001   2017-01-08      6217        8               -6209
10001   2017-01-09      6218        9               -6209
10001   2017-01-10      6219        10              -6209

But if we scroll down until we see values that are NULL for employee (from the ROW_NUMBER() PARTITION BY expression E.PosNo), we see that the rest differs, since the ROW_NUMBER() kept increasing 1 by 1 and the DATEDIFF jumped because there are employed people in between:

posNo   GeneratedDate   DateDiff    NullRowNumber   GroupingValue
10001   2017-12-28      6571        362             -6209
10001   2017-12-29      6572        363             -6209
10001   2017-12-30      6573        364             -6209
10001   2017-12-31      6574        365             -6209
...
10001   2018-04-02      6666        366             -6300
10001   2018-04-03      6667        367             -6300
10001   2018-04-04      6668        368             -6300
10001   2018-04-05      6669        369             -6300
10001   2018-04-06      6670        370             -6300
10001   2018-04-07      6671        371             -6300

Use use this "GroupingValue" as an additional GROUP BY to correctly separate position intervals that fall outside employed intervals.

like image 177
EzLo Avatar answered Nov 15 '22 04:11

EzLo