Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine contiguous date intervals

I have the following table structure:

id int -- more like a group id, not unique in the table
AddedOn datetime -- when the record was added

For a specific id there is at most one record each day. I have to write a query that returns contiguous (at day level) date intervals for each id. The expected result structure is:

id int
StartDate datetime
EndDate datetime

Note that the time part of AddedOn is available but it is not important here.

To make it clearer, here is some input data:

with data as 
(
  select * from
  (
    values
    (0, getdate()), --dummy record used to infer column types

    (1, '20150101'),
    (1, '20150102'),
    (1, '20150104'),
    (1, '20150105'),
    (1, '20150106'),

    (2, '20150101'),
    (2, '20150102'),
    (2, '20150103'),
    (2, '20150104'),
    (2, '20150106'),
    (2, '20150107'),

    (3, '20150101'),
    (3, '20150103'),
    (3, '20150105'),
    (3, '20150106'),
    (3, '20150108'),
    (3, '20150109'),
    (3, '20150110')
  ) as d(id, AddedOn)
  where id > 0 -- exclude dummy record
)
select * from data

And the expected result:

id      StartDate      EndDate
1       2015-01-01     2015-01-02
1       2015-01-04     2015-01-06

2       2015-01-01     2015-01-04
2       2015-01-06     2015-01-07

3       2015-01-01     2015-01-01
3       2015-01-03     2015-01-03
3       2015-01-05     2015-01-06
3       2015-01-08     2015-01-10

Although it looks like a common problem I couldn't find a similar enough question. Also I'm getting closer to a solution and I will post it when (and if) it works but I feel that there should be a more elegant one.

like image 672
B0Andrew Avatar asked Dec 11 '22 23:12

B0Andrew


2 Answers

Here's answer without any fancy joining, but simply using group by and row_number, which is not only simple but also more efficient.

WITH CTE_dayOfYear
AS
(
    SELECT  id,
            AddedOn,
            DATEDIFF(DAY,'20000101',AddedOn) dyID,
            ROW_NUMBER() OVER (ORDER BY ID,AddedOn) row_num
    FROM data
)

SELECT  ID,
        MIN(AddedOn) StartDate,
        MAX(AddedOn) EndDate,
        dyID-row_num AS groupID
FROM CTE_dayOfYear
GROUP BY ID,dyID - row_num
ORDER BY ID,2,3

The logic is that the dyID is based on the date so there are gaps while row_num has no gaps. So every time there is a gap in dyID, then it changes the difference between row_num and dyID. Then I simply use that difference as my groupID.

like image 112
Stephan Avatar answered Dec 21 '22 19:12

Stephan


In Sql Server 2008 it is a little bit pain without LEAD and LAG functions:

WITH    data
          AS ( SELECT   * ,
                        ROW_NUMBER() OVER ( ORDER BY id, AddedOn ) AS rn
               FROM     ( VALUES ( 0, GETDATE()), --dummy record used to infer column types
                        ( 1, '20150101'), ( 1, '20150102'), ( 1, '20150104'),
                        ( 1, '20150105'), ( 1, '20150106'), ( 2, '20150101'),
                        ( 2, '20150102'), ( 2, '20150103'), ( 2, '20150104'),
                        ( 2, '20150106'), ( 2, '20150107'), ( 3, '20150101'),
                        ( 3, '20150103'), ( 3, '20150105'), ( 3, '20150106'),
                        ( 3, '20150108'), ( 3, '20150109'), ( 3, '20150110') )
                        AS d ( id, AddedOn )
               WHERE    id > 0 -- exclude dummy record
             ),
        diff
          AS ( SELECT   d1.* ,
                        CASE WHEN ISNULL(DATEDIFF(dd, d2.AddedOn, d1.AddedOn),
                                         1) = 1 THEN 0
                             ELSE 1
                        END AS diff
               FROM     data d1
                        LEFT JOIN data d2 ON d1.id = d2.id
                                             AND d1.rn = d2.rn + 1
             ),
        parts
          AS ( SELECT   * ,
                        ( SELECT    SUM(diff)
                          FROM      diff d2
                          WHERE     d2.rn <= d1.rn
                        ) AS p
               FROM     diff d1
             )
    SELECT  id ,
            MIN(AddedOn) AS StartDate ,
            MAX(AddedOn) AS EndDate
    FROM    parts
    GROUP BY id ,
            p

Output:

id  StartDate               EndDate
1   2015-01-01 00:00:00.000 2015-01-02 00:00:00.000
1   2015-01-04 00:00:00.000 2015-01-06 00:00:00.000
2   2015-01-01 00:00:00.000 2015-01-04 00:00:00.000
2   2015-01-06 00:00:00.000 2015-01-07 00:00:00.000
3   2015-01-01 00:00:00.000 2015-01-01 00:00:00.000
3   2015-01-03 00:00:00.000 2015-01-03 00:00:00.000
3   2015-01-05 00:00:00.000 2015-01-06 00:00:00.000
3   2015-01-08 00:00:00.000 2015-01-10 00:00:00.000

Walkthrough:

diff This CTE returns data:

1   2015-01-01 00:00:00.000 1   0
1   2015-01-02 00:00:00.000 2   0
1   2015-01-04 00:00:00.000 3   1
1   2015-01-05 00:00:00.000 4   0
1   2015-01-06 00:00:00.000 5   0

You are joining same table on itself to get the previous row. Then you calculate difference in days between current row and previous row and if the result is 1 day then pick 0 else pick 1.

parts This CTE selects result from previous step and sums up the new column(it is a cumulative sum. sum of all values of new column from starting till current row), so you are getting partitions to group by:

1   2015-01-01 00:00:00.000 1   0   0
1   2015-01-02 00:00:00.000 2   0   0
1   2015-01-04 00:00:00.000 3   1   1
1   2015-01-05 00:00:00.000 4   0   1
1   2015-01-06 00:00:00.000 5   0   1
2   2015-01-01 00:00:00.000 6   0   1
2   2015-01-02 00:00:00.000 7   0   1
2   2015-01-03 00:00:00.000 8   0   1
2   2015-01-04 00:00:00.000 9   0   1
2   2015-01-06 00:00:00.000 10  1   2
2   2015-01-07 00:00:00.000 11  0   2
3   2015-01-01 00:00:00.000 12  0   2
3   2015-01-03 00:00:00.000 13  1   3

The last step is just a grouping by ID and new column and picking min and max values for dates.

like image 40
Giorgi Nakeuri Avatar answered Dec 21 '22 20:12

Giorgi Nakeuri