I have been going through stack overflow to try and work this out over the last week and I still can't work out a viable solution so was wondering if anyone could offer me some help/advice?
Explanation of the data structures
I have the following tables:
Position table (zz_position
) which is used to hold the details of the
position (Job ID) include the date range that it is valid for.
PosNo Description Date_From Date_To
---------------------------------------------------------
10001 System Administrator 20170101 20231231
Resource table (zz_resource
) which is used to hold the details of a resource (employee) including the date that they joined the company and left it
resID description date_from date_to
------------------------------------------
100 Sam 20160101 20991231
101 Joe 20150101 20991231
Employment table (zz_employment
) which is used to link position to resources within a date from and to range
PosNo resID Date_From Date_To seqNo
---------------------------------------------------
10001 100 20180101 20180401 1
10001 101 20180601 20191231 2
10001 100 20200101 20991231 3
Problem
Now due to people changing positions, a post might not be filled for a period of time and what I am trying to do is produce a report that I can use to give me a breakdown of the status of a post at any point in time.
I know that I can produce one which fully maps each day using a calendar table however what I want is a report which produces the data in the following aggregated format:
PosNo resID Date_From Date_To seqNo
-------------------------------------------------
10001 NULL 20170101 20171231 0
10001 100 20180101 20180401 1
10001 NULL 20180402 20180530 0
10001 101 20180601 20191231 2
10001 100 20200101 20231231 3
insert into zz_employment
values ('10001', '100', '2018-01-01 00:00:00.000', '2018-04-01 00:00:00.000', 1),
('10001', '101', '2018-06-01 00:00:00.000', '2019-12-31 00:00:00.000', 2),
('10001', '100', '2020-01-01 00:00:00.000', '2099-12-31 00:00:00.000', 3)
(note how the report has taken the two lines in the table and produced a fully speced out life of the employment where the first null line date from is pulled from the position start date and the last line date to is pulled from the position end date.
Ideally I would like this as a view/function however due to the complexity I would be more than happy to have a series of T SQL statements that I can run each night as part of a data warehouse routine.
Rules
SQL code:
CREATE TABLE zz_position
(
posNo varchar(25) NOT NULL,
description varchar(25) NOT NULL,
date_from datetime NULL,
date_to datetime NULL
)
insert into zz_position
values ('10001', 'System Administrator', '2017-01-01 00:00:00.000', '2020-12-31 00:00:00.000')
go
CREATE TABLE zz_resource
(
resID varchar(25) NOT NULL,
description varchar(25) NOT NULL,
date_from datetime NULL,
date_to datetime NULL
)
insert into zz_resource
values ('100', 'Sam', '2016-01-01 00:00:00.000', '2099-12-31 00:00:00.000'),
('101', 'Joe', '2015-01-01 00:00:00.000', '2099-12-31 00:00:00.000')
go
CREATE TABLE zz_employment
(
posNo varchar(25) NOT NULL,
resID varchar(25) NOT NULL,
date_from datetime NULL,
date_to datetime NULL,
seqNo int NULL
)
insert into zz_employment
values ('10001', '100', '2018-01-01 00:00:00.000', '2018-04-01 00:00:00.000', 1),
('10001', '101', '2018-06-01 00:00:00.000', '2019-12-31 00:00:00.000', 2),
('10001', '100', '2020-01-01 00:00:00.000', '2099-12-31 00:00:00.000', 3)
There are 2 caveats for this problem:
The following solution uses a calendar table (SQL included) and an DATEDIFF()
with anchor-date trick to group correctly for the 2nd point.
Complete DB Fiddle here.
;WITH AllPositionDates AS
(
SELECT
T.posNo,
C.GeneratedDate
FROM
zz_position AS T
INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
AllEmployedDates AS
(
SELECT
T.posNo,
T.resID,
T.seqNo,
C.GeneratedDate
FROM
zz_employment AS T
INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
PositionsByEmployed AS
(
SELECT
P.posNo,
P.GeneratedDate,
E.resID,
E.seqNo,
NullRowNumber = ROW_NUMBER() OVER (
PARTITION BY
P.posNo,
CASE WHEN E.posNo IS NULL THEN 1 ELSE 2 END
ORDER BY
P.GeneratedDate ASC)
FROM
AllPositionDates AS P
LEFT JOIN AllEmployedDates AS E ON
P.posNo = E.posNo AND
P.GeneratedDate = E.GeneratedDate
)
SELECT
P.posNo,
P.resID,
Date_From = MIN(P.GeneratedDate),
Date_To = MAX(P.GeneratedDate),
seqNo = ISNULL(P.seqNo, 0)
FROM
PositionsByEmployed AS P
GROUP BY
P.posNo,
P.resID,
P.seqNo,
CASE WHEN P.resId IS NULL THEN P.NullRowNumber - DATEDIFF(DAY, '2000-01-01', P.GeneratedDate) END -- GroupingValueGroupingValue
ORDER BY
P.posNo,
Date_From,
Date_To
The result:
posNo resID Date_From Date_To seqNo
10001 NULL 2017-01-01 2017-12-31 0
10001 100 2018-01-01 2018-04-01 1
10001 NULL 2018-04-02 2018-05-31 0
10001 101 2018-06-01 2019-12-31 2
10001 100 2020-01-01 2020-12-31 3
First the creating of a calendar table. This holds 1 row for each day and in this example it's limited to the first and last possible day of the job positions:
DECLARE @DateStart DATE = (SELECT MIN(P.date_from) FROM zz_position AS P)
DECLARE @DateEnd DATE = (SELECT(MAX(P.date_to)) FROM zz_position AS P)
;WITH GeneratedDates AS
(
SELECT
GeneratedDate = @DateStart
UNION ALL
SELECT
GeneratedDate = DATEADD(DAY, 1, G.GeneratedDate)
FROM
GeneratedDates AS G
WHERE
DATEADD(DAY, 1, G.GeneratedDate) <= @DateEnd
)
SELECT
DateID = IDENTITY(INT, 1, 1),
G.GeneratedDate
INTO
Calendar
FROM
GeneratedDates AS G
OPTION
(MAXRECURSION 0)
This generates the following (up to 2020-12-31, which is max date from sample data):
DateID GeneratedDate
1 2017-01-01
2 2017-01-02
3 2017-01-03
4 2017-01-04
5 2017-01-05
6 2017-01-06
7 2017-01-07
Now we use a join with a between to "spread" the periods of both the positions and the employees periods (on different CTEs), so we get 1 row for each day, for each position/employee.
-- AllPositionDates
SELECT
T.posNo,
C.GeneratedDate
FROM
zz_position AS T
INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
-- AllEmployedDates
SELECT
T.posNo,
T.resID,
T.seqNo,
C.GeneratedDate
FROM
zz_employment AS T
INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
With these, we join them together by position and date using LEFT JOIN
, so we get all days of each position and the matching employee (if exists). We also calculate a row number for all NULL
values for each position that we are gonna use later. Note that this row number increases 1 by 1 with each following date accordingly.
;WITH AllPositionDates AS
(
SELECT
T.posNo,
C.GeneratedDate
FROM
zz_position AS T
INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
AllEmployedDates AS
(
SELECT
T.posNo,
T.resID,
T.seqNo,
C.GeneratedDate
FROM
zz_employment AS T
INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
)
-- PositionsByEmployee
SELECT
P.posNo,
P.GeneratedDate,
E.resID,
E.seqNo,
NullRowNumber = ROW_NUMBER() OVER (
PARTITION BY
P.posNo,
CASE WHEN E.posNo IS NULL THEN 1 ELSE 2 END
ORDER BY
P.GeneratedDate ASC)
FROM
AllPositionDates AS P
LEFT JOIN AllEmployedDates AS E ON
P.posNo = E.posNo AND
P.GeneratedDate = E.GeneratedDate
Now with the tricky part. If we calculate the amount of days of difference between a hard-coded date and each day, we get a similar "row number" that increases consistently for each date.
SELECT
P.posNo,
P.GeneratedDate,
DateDiff = DATEDIFF(DAY, '2000-01-01', P.GeneratedDate),
P.NullRowNumber
FROM
PositionsByEmployed AS P -- This is declare with the WITH (full solution below)
ORDER BY
P.posNo,
P.GeneratedDate
We get the following:
posNo GeneratedDate DateDiff NullRowNumber
10001 2017-01-01 6210 1
10001 2017-01-02 6211 2
10001 2017-01-03 6212 3
10001 2017-01-04 6213 4
10001 2017-01-05 6214 5
10001 2017-01-06 6215 6
10001 2017-01-07 6216 7
10001 2017-01-08 6217 8
10001 2017-01-09 6218 9
If we add another column with the rest of these 2 you will see that the value remains the same:
SELECT
P.posNo,
P.GeneratedDate,
DateDiff = DATEDIFF(DAY, '2000-01-01', P.GeneratedDate),
P.NullRowNumber,
GroupingValue = P.NullRowNumber - DATEDIFF(DAY, '2000-01-01', P.GeneratedDate)
FROM
PositionsByEmployed AS P
ORDER BY
P.posNo,
P.GeneratedDate
We get:
posNo GeneratedDate DateDiff NullRowNumber GroupingValue
10001 2017-01-01 6210 1 -6209
10001 2017-01-02 6211 2 -6209
10001 2017-01-03 6212 3 -6209
10001 2017-01-04 6213 4 -6209
10001 2017-01-05 6214 5 -6209
10001 2017-01-06 6215 6 -6209
10001 2017-01-07 6216 7 -6209
10001 2017-01-08 6217 8 -6209
10001 2017-01-09 6218 9 -6209
10001 2017-01-10 6219 10 -6209
But if we scroll down until we see values that are NULL for employee (from the ROW_NUMBER() PARTITION BY
expression E.PosNo
), we see that the rest differs, since the ROW_NUMBER()
kept increasing 1 by 1 and the DATEDIFF
jumped because there are employed people in between:
posNo GeneratedDate DateDiff NullRowNumber GroupingValue
10001 2017-12-28 6571 362 -6209
10001 2017-12-29 6572 363 -6209
10001 2017-12-30 6573 364 -6209
10001 2017-12-31 6574 365 -6209
...
10001 2018-04-02 6666 366 -6300
10001 2018-04-03 6667 367 -6300
10001 2018-04-04 6668 368 -6300
10001 2018-04-05 6669 369 -6300
10001 2018-04-06 6670 370 -6300
10001 2018-04-07 6671 371 -6300
Use use this "GroupingValue" as an additional GROUP BY
to correctly separate position intervals that fall outside employed intervals.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With