Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging date intervals in SQL Server

I have the following data:

StartDate   |  EndDate
-------------------------
1982.03.02  |  1982.09.30 
1982.10.01  |  1985.01.17 
1985.06.26  |  1985.07.26 
1985.07.30  |  1991.12.31 
1992.01.01  |  1995.12.31 
1996.01.01  |  2004.05.31 
2004.06.05  |  2006.01.31 
2006.02.01  |  2011.05.20              

I need to merge any intervals that are adjacent (both start and the end date are included in the intervals, so an interval ending on 2003.05.06 is adjacent with an interval starting on 2003.05.07), so in this case, the resulting set should be:

StartDate   |  EndDate
-------------------------
1982.03.02  |  1985.01.17 
1985.06.26  |  1985.07.26 
1985.07.30  |  2004.05.31 
2004.06.05  |  2011.05.20              

For me, the obvious way to do this is to iterate the set with a cursor, and construct a result set row-by-row. However, this functionality will be within code that could potentially be called thousands of times in a day, on a server under heavy load, so I'd prefer not having any performance issues. Any data set is small (20 rows tops), and the data range is large, so any solution that generates all the dates in a range is unfeasible.

Is there a better way I'm not seeing?


Initialization code (from Damien's answer):

CREATE TABLE Periods (
    StartDate datetime NOT NULL CONSTRAINT PK_Periods PRIMARY KEY CLUSTERED,
    EndDate datetime NOT NULL
)

INSERT INTO Periods(StartDate,EndDate)
SELECT '19820302', '19820930'
UNION ALL SELECT '19821001', '19850117'
UNION ALL SELECT '19850626', '19850726'
UNION ALL SELECT '19850730', '19911231'
UNION ALL SELECT '19920101', '19951231'
UNION ALL SELECT '19960101', '20040531'
UNION ALL SELECT '20040605', '20060131'
UNION ALL SELECT '20060201', '20110520'
like image 323
SWeko Avatar asked May 20 '11 07:05

SWeko


3 Answers

It takes longer for me to set up the sample data than to write the query - it would be better if you posted questions that include CREATE TABLE and INSERT/SELECT statements. I don't know what your table is called, I've called mine Periods:

create table Periods (
    StartDate date not null,
    EndDate date not null
)
go
insert into Periods(StartDate,EndDate)
select '19820302','19820930' union all
select '19821001','19850117' union all
select '19850626','19850726' union all
select '19850730','19911231' union all
select '19920101','19951231' union all
select '19960101','20040531' union all
select '20040605','20060131' union all
select '20060201','20110520'
go
; with MergedPeriods as (
    Select p1.StartDate, p1.EndDate
    from
        Periods p1
            left join
        Periods p2
            on
                p1.StartDate = DATEADD(day,1,p2.EndDate)
    where
        p2.StartDate is null
    union all
    select p1.StartDate,p2.EndDate
    from
        MergedPeriods p1
            inner join
        Periods p2
            on
                p1.EndDate = DATEADD(day,-1,p2.StartDate)
)
select StartDate,MAX(EndDate) as EndDate
from MergedPeriods group by StartDate

Result:

StartDate   EndDate
1982-03-02  1985-01-17
1985-06-26  1985-07-26
1985-07-30  2004-05-31
2004-06-05  2011-05-20
like image 181
Damien_The_Unbeliever Avatar answered Oct 09 '22 23:10

Damien_The_Unbeliever


Here's a query that performs best of all submissions so far, with only two table accesses in the execution plan (instead of three or more). All queries are of course helped by indexes. Please note that the execution plan rates this query as more expensive, but the actual Reads & CPU are significantly better. Estimated costs in execution plans are not the same as actual performance.

WITH Grps AS (
   SELECT
      (Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
      P1.StartDate,
      P1.EndDate
   FROM
      Periods P1
      CROSS JOIN (SELECT -1 UNION ALL SELECT 1) D (Dir)
      LEFT JOIN Periods P2 ON
         DateAdd(Day, D.Dir, P1.StartDate) = P2.EndDate
         OR DateAdd(Day, D.Dir, P1.EndDate) = P2.StartDate
   WHERE
      (Dir = -1 AND P2.EndDate IS NULL)
      OR (Dir = 1 AND P2.StartDate IS NULL)
)
SELECT
   Min(StartDate) StartDate,
   Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;

One more thing I think worth mentioning is that querying your date period table would all around in most cases be simpler and better performing if you used exclusive end dates (aka "open" end dates) instead of closed ones:

StartDate   | EndDate     | EndDate
(Inclusive) | (Inclusive) | (Exclusive)
---------------------------------------
1982.03.02  | 1982.09.30  | 1982.10.01
1982.10.01  | 1985.01.17  | 1985.01.18

Using exclusive end dates is (in my opinion) best practice most of the time because it allows you to change the data type of the date column or to change the resolution of the date, without affecting any queries, code, or other logic. For example, if your dates needed to be to the nearest 12 hours instead of 24 hours, you'd have major work to get that accomplished, whereas if you used exclusive end dates not a single thing would have to change!

If you were using exclusive end dates, my query would look like this:

WITH Grps AS (
   SELECT
      (Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
      P1.StartDate,
      P1.EndDate
   FROM
      Periods P1
      CROSS JOIN (SELECT 1 UNION ALL SELECT 2) X (Which)
      LEFT JOIN Periods P2 ON
         (X.Which = 1 AND P1.StartDate = P2.EndDate)
         OR (X.Which = 2 AND P1.EndDate = P2.StartDate)
   WHERE
      P2.EndDate IS NULL
      OR P2.StartDate IS NULL
)
SELECT
   Min(StartDate) StartDate,
   Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;

Notice there's no DateAdd or DateDiff now, with hardcoded values of "1 Day" that would have to change if you for example switched to 12-hour periods.

Update

Here's an updated query that incorporates things I've learned in the last almost 5 years. This query now has no joins at all, and though it does have 3 sort operations in it which could be performance problems, I think this query will compete reasonably well, and in the absence of indexes will probably beat all others hands down.

WITH Groups AS (
   SELECT Grp = Row_Number() OVER (ORDER BY StartDate) / 2, *
   FROM
      #Periods
      (VALUES (0), (0)) X (Dup)
), Ranges AS (
   SELECT StartDate = Max(StartDate), EndDate = Min(EndDate)
   FROM Groups
   GROUP BY Grp
   HAVING Max(StartDate) <> DateAdd(day, 1, Min(EndDate))
), ReGroups AS (
   SELECT
      Grp = Row_Number() OVER (ORDER BY StartDate) / 2,
      StartDate,
      EndDate
   FROM
      Ranges
      CROSS JOIN (VALUES (0), (0)) X (Dup)
)
SELECT
   StartDate = Min(StartDate),
   EndDate = Max(EndDate)
FROM ReGroups
GROUP BY Grp
HAVING Count(*) = 2
;

And here's yet another version using windowing functions (kind of what the previous query is simulating):

WITH LeadLag AS (
   SELECT
      PrevEndDate = Coalesce(Lag(EndDate) OVER (ORDER BY StartDate), '00010101'),
      NextStartDate = Coalesce(Lead(StartDate) OVER (ORDER BY StartDate), '99991231'),
      *
   FROM #Periods
), Dates AS (
   SELECT
      X.*
   FROM
      LeadLag
      CROSS APPLY (
         SELECT
            StartDate = CASE WHEN DateAdd(day, 1, PrevEndDate) <> StartDate THEN StartDate ELSE NULL END,
            EndDate = CASE WHEN DateAdd(day, 1, EndDate) <> NextStartDate THEN EndDate ELSE NULL END
      ) X
   WHERE
      X.StartDate IS NOT NULL
      OR X.EndDate IS NOT NULL
), Final AS (
   SELECT
      StartDate,
      EndDate = Min(EndDate) OVER (ORDER BY EndDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
   FROM Dates
)
SELECT *
FROM Final
WHERE StartDate IS NOT NULL
;
like image 44
ErikE Avatar answered Oct 09 '22 21:10

ErikE


You could look up the heads: rows that start a period. Then search for the last end date before the next head in a subquery:

; with heads as
        (
        select  StartDate
        ,       EndDate
        ,       row_number() over (order by StartDate) as rn
        from    @YourTable h
        where   not exists
                (
                select  *
                from    @YourTable next
                where   next.EndDate = dateadd(day, -1, h.StartDate)
                )
        )
select  heads.StartDate
,       (
        select  top 1 EndDate
        from    @YourTable
        where   EndDate < COALESCE(
                (
                select  StartDate
                from    heads h2
                where   heads.rn + 1 = h2.rn
                ), '9999-01-01')
        order by
                EndDate desc
        ) as EndDate
from    heads

Example at ODATA.

like image 22
Andomar Avatar answered Oct 09 '22 21:10

Andomar