Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge rows if date columns are overlapping in TSQL

I have a table in the following format

Id  StartDate   EndDate Type
1   2012-02-18  2012-03-18  1
1   2012-03-17  2012-06-29  1
1   2012-06-27  2012-09-27  1
1   2014-08-23  2014-09-24  3
1   2014-09-23  2014-10-24  3
1   2014-10-23  2014-11-24  3
2   2015-07-04  2015-08-06  1
2   2015-08-04  2015-09-06  1
3   2013-11-01  2013-12-01  0
3   2018-01-09  2018-02-09  0

I found similar questions here, but not something that could help me solve my problem. I want to merge rows that has the same Id, Type and overlapping date periods.

The result from the above table should be

Id  StartDate   EndDate Type
1   2012-02-18  2012-09-27  1
1   2014-08-23  2014-11-24  3
2   2015-07-04  2015-09-06  1
3   2013-11-01  2013-12-01  0
3   2018-01-09  2018-02-09  0

In another server, I was able to do it with the following restrictions and the query below:

  • Didn't care about the Type column, but just the Id
  • Had a newer version of SQL Server (2012), but now I have 2008 which the code is not compatible

SELECT Id
     , MIN(StartDate) AS StartDate
     , MAX(EndDate) AS EndDate
FROM (
    SELECT *
         , SUM(CASE WHEN a.EndDate = a.StartDate THEN 0
                    ELSE 1
               END
           ) OVER (ORDER BY Id, StartDate) sm
    FROM (
        SELECT Id
             , StartDate
             , EndDate
             , LAG(EndDate, 1, NULL) OVER (PARTITION BY Id ORDER BY Id, EndDate) EndDate
        FROM #temptable
    ) a
) b
GROUP BY Id, sm

Any advice how I can

  • Include Type on the process
  • Make it work on SQL Server 2008
like image 811
Tasos Avatar asked May 07 '19 08:05

Tasos


3 Answers

This is 2008 compatible. A CTE really is the best way to link up all overlapping records in my opinion. The date overlap logic came from this thread: SO Date Overlap

I added extra data that's more complex to make sure that it's working as expected.

DECLARE @Data table (Id INT, StartDate DATE, EndDate DATE, Type INT)
INSERT INTO @data 

SELECT 1,'2/18/2012' ,'3/18/2012', 1 UNION ALL
select 1,'3/17/2012','6/29/2012',1 UNION ALL
select 1,'6/27/2012','9/27/2012',1 UNION ALL
select 1,'8/23/2014','9/24/2014',3 UNION ALL
select 1,'9/23/2014','10/24/2014',3 UNION ALL
select 1,'10/23/2014','11/24/2014',3 UNION ALL
select 2,'7/4/2015','8/6/2015',1 UNION ALL
select 2,'8/4/2015','9/6/2015',1 UNION ALL
select 3,'11/1/2013','12/1/2013',0 UNION ALL
select 3,'1/9/2018','2/9/2018',0 UNION ALL 
select 4,'1/1/2018','1/2/2018',0 UNION ALL --many non overlapping dates
select 4,'1/4/2018','1/5/2018',0 UNION ALL
select 4,'1/7/2018','1/9/2018',0 UNION ALL
select 4,'1/11/2018','1/13/2018',0 UNION ALL

select 4,'2/7/2018','2/8/2018',0 UNION ALL --many overlapping dates
select 4,'2/8/2018','2/9/2018',0 UNION ALL 
select 4,'2/9/2018','2/10/2018',0 UNION all
select 4,'2/10/2018','2/11/2018',0 UNION all
select 4,'2/11/2018','2/12/2018',0 UNION all
select 4,'2/12/2018','2/13/2018',0 UNION all

select 4,'3/7/2018','3/8/2018',0 UNION ALL --many overlapping dates, second instance of id 4, type 0
select 4,'3/8/2018','3/9/2018',0 UNION ALL 
select 4,'3/9/2018','3/10/2018',0 UNION all
select 4,'3/10/2018','3/11/2018',0 UNION all
select 4,'3/11/2018','3/12/2018',0 UNION all
select 4,'3/12/2018','3/13/2018',0 



;
WITH cdata
AS (SELECT  Id,
        d.Type,
        d.StartDate,
        d.EndDate,
        CurrentStart = d.StartDate
    FROM    @Data d
    WHERE
        NOT EXISTS (
            SELECT * FROM @Data x WHERE x.StartDate < d.StartDate AND d.StartDate <= x.EndDate AND d.EndDate >= x.StartDate AND d.Id = x.Id AND d.Type = x.Type --get first records for overlapping ranges

                )       
    UNION ALL
    SELECT  d.Id,
        d.Type,
        StartDate = CASE WHEN d2.StartDate < d.StartDate THEN d2.StartDate ELSE d.StartDate END,
        EndDate = CASE WHEN d2.EndDate > d.EndDate THEN d2.EndDate ELSE d.EndDate END,
        CurrentStart = d2.StartDate
    FROM    cdata d
        INNER JOIN @Data d2
            ON (
                d.StartDate <= d2.EndDate
                AND d.EndDate >= d2.StartDate
            ) 
            AND d2.Id = d.Id
            AND d2.Type = d.Type
            AND d2.StartDate > d.CurrentStart)
SELECT cdata.Id, cdata.Type, cdata.StartDate, EndDate = MAX(cdata.EndDate) 
FROM        cdata 
GROUP BY cdata.Id, cdata.Type, cdata.StartDate
like image 44
Jordan Ryder Avatar answered Oct 20 '22 00:10

Jordan Ryder


This approach uses an additional temp table to identify the groups of overlapping dates, and then performs a quick aggregate based on the groupings.

SELECT *, ROW_NUMBER() OVER (ORDER BY Id, Type) AS UID,
    ROW_NUMBER() OVER (ORDER BY Id, Type) AS GroupId INTO #G FROM #TempTable
WHILE @@ROWCOUNT <> 0 BEGIN
    UPDATE T1 SET
        GroupId = T2.GroupId
    FROM #G T1
        INNER JOIN (
            SELECT T1.UID, CASE WHEN T1.GroupId  < T2.GroupId THEN T1.GroupId ELSE T2.GroupId END
            FROM #G T1
                LEFT OUTER JOIN #G T2
                    ON T1.Id = T2.Id AND T1.Type = T2.Type AND T1.GroupId <> T2.GroupId
                        AND T1.StartDate <= T2.EndDate AND T2.StartDate <= T1.EndDate
            ) T2 (UID, GroupId)
            ON T1.UID = T2.UID
    WHERE T1.GroupId <> T2.GroupId
END
SELECT Id, MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate, Type
FROM #G G GROUP BY GroupId, Id, Type

This returns the expected values

Id          StartDate  EndDate    Type
----------- ---------- ---------- -----------
1           2012-02-18 2012-09-27 1
1           2014-08-23 2014-11-24 3
2           2015-07-04 2015-09-06 1
3           2013-11-01 2013-12-01 0
3           2018-01-09 2018-02-09 0
like image 172
Jason W Avatar answered Oct 20 '22 00:10

Jason W


This looks like a Packing Intervals problem. See the post by Itzik Ben-Gan for all the details and what indexes he recommends to make it work efficiently. He presents a solution without recursive CTE.

Two notes.

  1. The query below assumes that intervals are [closed; open), i.e. StartDate is inclusive and EndDate is exclusive. This way to represent such data is often the most convenient. (in the same sense as having arrays as zero-based instead of 1-based is usually more convenient in programming languages).

  2. I added a RowID column to have unambiguous sorting.

Sample data

DECLARE @T TABLE
(
    RowID int IDENTITY,
    id int,
    StartDate date,
    EndDate date,
    tp int
);

INSERT INTO @T(Id, StartDate, EndDate, tp) VALUES
(1,  '2012-02-18',  '2012-03-18',  1),
(1,  '2012-03-17',  '2012-06-29',  1),
(1,  '2012-06-27',  '2012-09-27',  1),
(1,  '2014-08-23',  '2014-09-24',  3),
(1,  '2014-09-23',  '2014-10-24',  3),
(1,  '2014-10-23',  '2014-11-24',  3),
(2,  '2015-07-04',  '2015-08-06',  1),
(2,  '2015-08-04',  '2015-09-06',  1),
(3,  '2013-11-01',  '2013-12-01',  0),
(3,  '2018-01-09',  '2018-02-09',  0);

-- Make EndDate an opened interval, make it exclusive
-- [Start; End)
UPDATE @T
SET EndDate = DATEADD(day, 1, EndDate)
;

Recommended indexes

-- indexes to support solutions
CREATE UNIQUE INDEX idx_start_id ON T(id, tp, StartDate, RowID);
CREATE UNIQUE INDEX idx_end_id ON T(id, tp, EndDate, RowID);

Query

Read the Itzik's post to understand what is going on. He has nice illustrations there. In short, each timestamp (start or end) is treated as an event. Each event has a + or - type. Each time we encounter a + event (some interval starts) we increase the running counter. Each time we encounter a - event (some interval ends) we decrease the running counter. When the running counter is 0 it means that the streak of overlapping intervals is over.

I took Itzik's query as is and simply changed the column names to match your names.

WITH C1 AS
-- let e = end ordinals, let s = start ordinals
(
    SELECT
        RowID, id, tp, StartDate AS ts, +1 AS EventType,
        NULL AS e,
        ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY StartDate, RowID) AS s
    FROM @T

    UNION ALL

    SELECT
        RowID, id, tp, EndDate AS ts, -1 AS EventType,
        ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY EndDate, RowID) AS e,
        NULL AS s
    FROM @T
),
C2 AS
-- let se = start or end ordinal, namely, how many events (start or end) happened so far
(
    SELECT C1.*,
    ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts, EventType DESC, RowID) AS se
    FROM C1
),
C3 AS
-- For start events, the expression s - (se - s) - 1 represents how many sessions were active
-- just before the current (hence - 1)
--
-- For end events, the expression (se - e) - e represents how many sessions are active
-- right after this one
--
-- The above two expressions are 0 exactly when a group of packed intervals
-- either starts or ends, respectively
--
-- After filtering only events when a group of packed intervals either starts or ends,
-- group each pair of adjacent start/end events
(
    SELECT id, tp, ts,
        ((ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts) - 1) / 2 + 1)
        AS grpnum
    FROM C2
    WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT id, tp, MIN(ts) AS StartDate, DATEADD(day, -1, MAX(ts)) AS EndDate
FROM C3
GROUP BY id, tp, grpnum
ORDER BY id, tp, StartDate;

Result

+----+----+------------+------------+
| id | tp | StartDate  |  EndDate   |
+----+----+------------+------------+
|  1 |  1 | 2012-02-18 | 2012-09-27 |
|  1 |  3 | 2014-08-23 | 2014-11-24 |
|  2 |  1 | 2015-07-04 | 2015-09-06 |
|  3 |  0 | 2013-11-01 | 2013-12-01 |
|  3 |  0 | 2018-01-09 | 2018-02-09 |
+----+----+------------+------------+
like image 43
Vladimir Baranov Avatar answered Oct 20 '22 01:10

Vladimir Baranov