Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flattening date intervals in SQL

I have a database table where there are three columns that are essential to this question:

  • A group ID, that groups rows together
  • A start date
  • An end date

I want to make a view from this table so that overlapping date intervals, that have the same grouping ID, are flattened.

Date intervals that are not overlapping shall not be flattened.

Example:

Group ID       Start         End
   1        2016-01-01   2017-12-31
   1        2016-06-01   2020-01-01
   1        2022-08-31   2030-12-31
   2        2010-03-01   2017-01-01
   2        2012-01-01   2013-12-31
   3        2001-01-01   9999-13-31

...becomes...

Group ID       Start         End
   1        2016-01-01   2020-01-01
   1        2022-08-31   2030-12-31
   2        2010-03-01   2017-01-01
   3        2001-01-01   9999-12-31

Intervals that overlap may do so in any way, completely enclosed by other intervals, or they may be staggered, or they may even have the same start and/or end dates.

There are few similar ids. Commonly (> 95%) there is only one row with a particular group ID. There are about a thousand IDs that show up in two rows; a handful of IDs that exist in three rows; none that are in four rows or more.

But I need to be prepared that there may show up group IDs that exist in four or more rows.

How can I write an SQL statement that creates a view that shows the table flattened this way?

Do note that every row also has a unique ID. This does not need to be preserved in any way, but in case it helps when writing the SQL, I am letting you know.

like image 273
MichaelK Avatar asked Oct 11 '16 08:10

MichaelK


People also ask

What is flattening in SQL?

FLATTEN is a table function that takes a VARIANT, OBJECT, or ARRAY column and produces a lateral view (i.e. an inline view that contains correlation referring to other tables that precede it in the FROM clause). FLATTEN can be used to convert semi-structured data to a relational representation.

How do you set a date range in SQL?

Select a column with a date data type from a table in the Diagram tab. Select the Where Condition field below the date column and click . Select the type of calendar to use for the date range values. If you select a calendar type other than Gregorian, you can click to edit the selected calendar.

How to calculate dates in SQL?

To find the difference between dates, use the DATEDIFF(datepart, startdate, enddate) function. The datepart argument defines the part of the date/datetime in which you'd like to express the difference. Its value can be year , quarter , month , day , minute , etc.


4 Answers

First, find intervals that are not continuation of overlapping sequence:

select * 
from dateclap d1
where not exists(
    select * 
    from dateclap d2 
    where d2.group_id=d1.group_id and 
        d2.end_date >= d1.start_date and 
        (d2.start_date < d1.start_date or 
        (d1.start_date=d2.start_date and d2.r_id<d1.r_id)))

Last line distinguishes intervals starting at the same date/time, ordering them by unique record id (r_id).

Then for each such record we can get hierarchical selection of records with connect_by_root r_id distinguishing clamp groups. After that all we need is to get min/max for clamp group (connect_by_root r_id is id of parent record in group):

select group_id, min(start_date) as start_date, max(end_date) as end_date
from dateclap d1
start with not exists(
    select * 
    from dateclap d2 
    where d2.group_id=d1.group_id and 
        d2.end_date >= d1.start_date and 
        (d2.start_date < d1.start_date or 
        (d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
connect by nocycle
    prior group_id=group_id and 
    start_date between prior start_date and prior end_date
group by group_id, connect_by_root r_id

Note nocycle here - it is a dirty trick to avoid exceptions because connection is weak and in fact tries to connect record to itself. You can refine condition after "connect by" similar to "exists" condition to avoid nocycle usage.

P.S. Table was created for tests like this:

CREATE TABLE "ANIKIN"."DATECLAP" 
(   
    "R_ID" NUMBER, 
    "GROUP_ID" NUMBER, 
    "START_DATE" DATE, 
    "END_DATE" DATE
) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "ANIKIN" ;

Unique key (or probably primary key) for r_id and corresponding seuqence/triggers are not something specific to tests, just populate r_id with unique values.

like image 53
Alexander Anikin Avatar answered Sep 26 '22 08:09

Alexander Anikin


   select t1.group_id, least(min(t1.start_date),  min(t2.start_date)),  greatest(max(t1.start_date), max(t2.end_date)) from test_interval t1, test_interval t2 
   where (t1.start_date, t1.end_date) overlaps (t2.start_date, t2.end_date) 
      and t1.rowid <> t2.rowid 
      and t1.group_id = t2.group_id group by t1.group_id;

Such query produces for me list of overlapping intervals. OVERLAPS is an undocumented operator. I only wonder if that won't return wrong result when we got two pair of intervals that are overlapping in pair but not each other. Where I used rowid you can use your unique row identifier

like image 20
Kacper Avatar answered Sep 25 '22 08:09

Kacper


Create 2 functions that return the flattened start- and end-date for a specific element:

CREATE OR REPLACE FUNCTION getMinStartDate
(
  p_group_id IN NUMBER,
  p_start    IN DATE
)
RETURN DATE AS
  v_result DATE;
BEGIN
  SELECT MIN(start_date)
    INTO v_result
    FROM my_data
   WHERE group_id = p_group_id
     AND start_date <= p_start
     AND end_date >= p_start;
  RETURN v_result;
END getMinStartDate;

CREATE OR REPLACE FUNCTION getMaxEndDate
(
  p_group_id IN NUMBER,
  p_end      IN DATE
)
RETURN DATE AS
  v_result DATE;
BEGIN
  SELECT MAX(end_date)
    INTO v_result
    FROM my_data
   WHERE group_id = p_group_id
     AND start_date <= p_end
     AND end_date >= p_end;
  RETURN v_result;
END getMaxEndDate;

Your view should then return, for each element, these flattened dates.
Of course, DISTINCT since various elements may result in the same dates:

SELECT DISTINCT
       group_id,
       getMinStartDate(group_id, start_date) AS start_date,
       getMaxEndDate(group_id, end_date) AS end_date
FROM   my_data;
like image 39
Robert Kock Avatar answered Sep 25 '22 08:09

Robert Kock


The input data shows an end date of 9999-13-31 in the last row. That should be corrected.

With that said, it is best to choose a made-up end date that is not exactly 9999-12-31. In many problems one needs to add a day, or a couple of weeks, or whatever, to all the dates in a table; but if one tries to add to 9999-12-31, that will fail. I prefer 8999-12-31; one thousand years should be enough for most computations. {:-) In the test data I created for my query I used this convention. (The solution can be easily adapted for 9999-12-31 though.)

When working with datetime intervals, remember that a pure date means midnight at the beginning of a day. So the year 2016 has the "end date" 2017-01-01 (midnight at the beginning of the day) and the year 2017 has the "start date" 2017-01-01 also. So the table SHOULD have the same end-date and start-date for periods that immediately follow each other - and they should be fused together into a single interval. However, an interval ending on 2016-08-31 and one that begins on 2016-09-01 should NOT be fused together; they are separated by a full day (specifically the day of 2016-08-31 is NOT included in either interval).

The OP did not specify how the end-dates are meant to be interpreted here. I assume they are as described in the last paragraph; otherwise the solution can be easily adapted (but it will require adding 1 to end dates first, and then subtracting 1 at the end - this is exactly one of those cases when 9999-12-31 is not a good placeholder for "unknown".)

Solution:

with m as
        (
         select group_id, start_date,
                   max(end_date) over (partition by group_id order by start_date 
                             rows between unbounded preceding and 1 preceding) as m_time
         from inputs   -- "inputs" is the name of the base table
         union all
         select group_id, NULL, max(end_date) from inputs group by group_id
        ),
     n as
        (
         select group_id, start_date, m_time 
         from m 
         where start_date > m_time or start_date is null or m_time is null
        ),
     f as
        (
         select group_id, start_date,
            lead(m_time) over (partition by group_id order by start_date) as end_date
         from n
        )
select * from f where start_date is not null
;

Output (with the data provided):

  GROUP_ID START_DATE END_DATE 
---------- ---------- ----------
         1 2016-01-01 2020-01-01
         1 2022-08-31 2030-12-31
         2 2010-03-01 2017-01-01
         3 2001-01-01 8999-12-31
like image 20
mathguy Avatar answered Sep 24 '22 08:09

mathguy