I have a database table where there are three columns that are essential to this question: <ul> <li>A group ID, that groups rows together</li> <li>A start date </li> <li>An end date</li> </ul> I want to make a view from this table so that overlapping date intervals, that have the same grouping ID, are flattened. Date intervals that are not overlapping shall not be flattened. Example: <pre class="prettyprint"><code>Group ID Start End 1 2016-01-01 2017-12-31 1 2016-06-01 2020-01-01 1 2022-08-31 2030-12-31 2 2010-03-01 2017-01-01 2 2012-01-01 2013-12-31 3 2001-01-01 9999-13-31 </code></pre> ...becomes... <pre class="prettyprint"><code>Group ID Start End 1 2016-01-01 2020-01-01 1 2022-08-31 2030-12-31 2 2010-03-01 2017-01-01 3 2001-01-01 9999-12-31 </code></pre> Intervals that overlap may do so in any way, completely enclosed by other intervals, or they may be staggered, or they may even have the same start and/or end dates. There are few similar ids. Commonly (> 95%) there is only one row with a particular group ID. There are about a thousand IDs that show up in two rows; a handful of IDs that exist in three rows; none that are in four rows or more. But I need to be prepared that there may show up group IDs that exist in four or more rows. How can I write an SQL statement that creates a view that shows the table flattened this way? Do note that every row also has a unique ID. This does not need to be preserved in any way, but in case it helps when writing the SQL, I am letting you know.

First, find intervals that are not continuation of overlapping sequence: <pre class="prettyprint"><code>select * from dateclap d1 where not exists( select * from dateclap d2 where d2.group_id=d1.group_id and d2.end_date >= d1.start_date and (d2.start_date < d1.start_date or (d1.start_date=d2.start_date and d2.r_id<d1.r_id))) </code></pre> Last line distinguishes intervals starting at the same date/time, ordering them by unique record id (r_id). Then for each such record we can get hierarchical selection of records with connect_by_root r_id distinguishing clamp groups. After that all we need is to get min/max for clamp group (connect_by_root r_id is id of parent record in group): <pre class="prettyprint"><code>select group_id, min(start_date) as start_date, max(end_date) as end_date from dateclap d1 start with not exists( select * from dateclap d2 where d2.group_id=d1.group_id and d2.end_date >= d1.start_date and (d2.start_date < d1.start_date or (d1.start_date=d2.start_date and d2.r_id<d1.r_id))) connect by nocycle prior group_id=group_id and start_date between prior start_date and prior end_date group by group_id, connect_by_root r_id </code></pre> Note nocycle here - it is a dirty trick to avoid exceptions because connection is weak and in fact tries to connect record to itself. You can refine condition after "connect by" similar to "exists" condition to avoid nocycle usage. P.S. Table was created for tests like this: <pre class="prettyprint"><code>CREATE TABLE "ANIKIN"."DATECLAP" ( "R_ID" NUMBER, "GROUP_ID" NUMBER, "START_DATE" DATE, "END_DATE" DATE ) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT) TABLESPACE "ANIKIN" ; </code></pre> Unique key (or probably primary key) for r_id and corresponding seuqence/triggers are not something specific to tests, just populate r_id with unique values.

<pre class="prettyprint"><code> select t1.group_id, least(min(t1.start_date), min(t2.start_date)), greatest(max(t1.start_date), max(t2.end_date)) from test_interval t1, test_interval t2 where (t1.start_date, t1.end_date) overlaps (t2.start_date, t2.end_date) and t1.rowid <> t2.rowid and t1.group_id = t2.group_id group by t1.group_id; </code></pre> Such query produces for me list of overlapping intervals. OVERLAPS is an undocumented operator. I only wonder if that won't return wrong result when we got two pair of intervals that are overlapping in pair but not each other. Where I used rowid you can use your unique row identifier

The input data shows an end date of 9999-13-31 in the last row. That should be corrected. With that said, it is best to choose a made-up end date that is not exactly 9999-12-31. In many problems one needs to add a day, or a couple of weeks, or whatever, to all the dates in a table; but if one tries to add to 9999-12-31, that will fail. I prefer 8999-12-31; one thousand years should be enough for most computations. {:-) In the test data I created for my query I used this convention. (The solution can be easily adapted for 9999-12-31 though.) When working with datetime intervals, remember that a pure date means midnight at the beginning of a day. So the year 2016 has the "end date" 2017-01-01 (midnight at the beginning of the day) and the year 2017 has the "start date" 2017-01-01 also. So the table SHOULD have the same end-date and start-date for periods that immediately follow each other - and they should be fused together into a single interval. However, an interval ending on 2016-08-31 and one that begins on 2016-09-01 should NOT be fused together; they are separated by a full day (specifically the day of 2016-08-31 is NOT included in either interval). The OP did not specify how the end-dates are meant to be interpreted here. I assume they are as described in the last paragraph; otherwise the solution can be easily adapted (but it will require adding 1 to end dates first, and then subtracting 1 at the end - this is exactly one of those cases when 9999-12-31 is not a good placeholder for "unknown".) Solution: <pre class="prettyprint"><code>with m as ( select group_id, start_date, max(end_date) over (partition by group_id order by start_date rows between unbounded preceding and 1 preceding) as m_time from inputs -- "inputs" is the name of the base table union all select group_id, NULL, max(end_date) from inputs group by group_id ), n as ( select group_id, start_date, m_time from m where start_date > m_time or start_date is null or m_time is null ), f as ( select group_id, start_date, lead(m_time) over (partition by group_id order by start_date) as end_date from n ) select * from f where start_date is not null ; </code></pre> Output (with the data provided): <pre class="prettyprint"><code> GROUP_ID START_DATE END_DATE ---------- ---------- ---------- 1 2016-01-01 2020-01-01 1 2022-08-31 2030-12-31 2 2010-03-01 2017-01-01 3 2001-01-01 8999-12-31 </code></pre>

Flattening date intervals in SQL

Tags:

sql

oracle

oracle11g

I have a database table where there are three columns that are essential to this question:

A group ID, that groups rows together
A start date
An end date

I want to make a view from this table so that overlapping date intervals, that have the same grouping ID, are flattened.

Date intervals that are not overlapping shall not be flattened.

Example:

Group ID       Start         End
   1        2016-01-01   2017-12-31
   1        2016-06-01   2020-01-01
   1        2022-08-31   2030-12-31
   2        2010-03-01   2017-01-01
   2        2012-01-01   2013-12-31
   3        2001-01-01   9999-13-31

...becomes...

Group ID       Start         End
   1        2016-01-01   2020-01-01
   1        2022-08-31   2030-12-31
   2        2010-03-01   2017-01-01
   3        2001-01-01   9999-12-31

Intervals that overlap may do so in any way, completely enclosed by other intervals, or they may be staggered, or they may even have the same start and/or end dates.

There are few similar ids. Commonly (> 95%) there is only one row with a particular group ID. There are about a thousand IDs that show up in two rows; a handful of IDs that exist in three rows; none that are in four rows or more.

But I need to be prepared that there may show up group IDs that exist in four or more rows.

How can I write an SQL statement that creates a view that shows the table flattened this way?

Do note that every row also has a unique ID. This does not need to be preserved in any way, but in case it helps when writing the SQL, I am letting you know.

273

asked Oct 11 '16 08:10

MichaelK

4 Answers

First, find intervals that are not continuation of overlapping sequence:

select * 
from dateclap d1
where not exists(
    select * 
    from dateclap d2 
    where d2.group_id=d1.group_id and 
        d2.end_date >= d1.start_date and 
        (d2.start_date < d1.start_date or 
        (d1.start_date=d2.start_date and d2.r_id<d1.r_id)))

Last line distinguishes intervals starting at the same date/time, ordering them by unique record id (r_id).

Then for each such record we can get hierarchical selection of records with connect_by_root r_id distinguishing clamp groups. After that all we need is to get min/max for clamp group (connect_by_root r_id is id of parent record in group):

select group_id, min(start_date) as start_date, max(end_date) as end_date
from dateclap d1
start with not exists(
    select * 
    from dateclap d2 
    where d2.group_id=d1.group_id and 
        d2.end_date >= d1.start_date and 
        (d2.start_date < d1.start_date or 
        (d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
connect by nocycle
    prior group_id=group_id and 
    start_date between prior start_date and prior end_date
group by group_id, connect_by_root r_id

Note nocycle here - it is a dirty trick to avoid exceptions because connection is weak and in fact tries to connect record to itself. You can refine condition after "connect by" similar to "exists" condition to avoid nocycle usage.

P.S. Table was created for tests like this:

CREATE TABLE "ANIKIN"."DATECLAP" 
(   
    "R_ID" NUMBER, 
    "GROUP_ID" NUMBER, 
    "START_DATE" DATE, 
    "END_DATE" DATE
) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "ANIKIN" ;

Unique key (or probably primary key) for r_id and corresponding seuqence/triggers are not something specific to tests, just populate r_id with unique values.

answered Sep 26 '22 08:09

Alexander Anikin

   select t1.group_id, least(min(t1.start_date),  min(t2.start_date)),  greatest(max(t1.start_date), max(t2.end_date)) from test_interval t1, test_interval t2 
   where (t1.start_date, t1.end_date) overlaps (t2.start_date, t2.end_date) 
      and t1.rowid <> t2.rowid 
      and t1.group_id = t2.group_id group by t1.group_id;

Such query produces for me list of overlapping intervals. OVERLAPS is an undocumented operator. I only wonder if that won't return wrong result when we got two pair of intervals that are overlapping in pair but not each other. Where I used rowid you can use your unique row identifier

answered Sep 25 '22 08:09

Kacper

Create 2 functions that return the flattened start- and end-date for a specific element:

CREATE OR REPLACE FUNCTION getMinStartDate
(
  p_group_id IN NUMBER,
  p_start    IN DATE
)
RETURN DATE AS
  v_result DATE;
BEGIN
  SELECT MIN(start_date)
    INTO v_result
    FROM my_data
   WHERE group_id = p_group_id
     AND start_date <= p_start
     AND end_date >= p_start;
  RETURN v_result;
END getMinStartDate;

CREATE OR REPLACE FUNCTION getMaxEndDate
(
  p_group_id IN NUMBER,
  p_end      IN DATE
)
RETURN DATE AS
  v_result DATE;
BEGIN
  SELECT MAX(end_date)
    INTO v_result
    FROM my_data
   WHERE group_id = p_group_id
     AND start_date <= p_end
     AND end_date >= p_end;
  RETURN v_result;
END getMaxEndDate;

Your view should then return, for each element, these flattened dates.
Of course, DISTINCT since various elements may result in the same dates:

SELECT DISTINCT
       group_id,
       getMinStartDate(group_id, start_date) AS start_date,
       getMaxEndDate(group_id, end_date) AS end_date
FROM   my_data;

answered Sep 25 '22 08:09

Robert Kock

The input data shows an end date of 9999-13-31 in the last row. That should be corrected.

With that said, it is best to choose a made-up end date that is not exactly 9999-12-31. In many problems one needs to add a day, or a couple of weeks, or whatever, to all the dates in a table; but if one tries to add to 9999-12-31, that will fail. I prefer 8999-12-31; one thousand years should be enough for most computations. {:-) In the test data I created for my query I used this convention. (The solution can be easily adapted for 9999-12-31 though.)

When working with datetime intervals, remember that a pure date means midnight at the beginning of a day. So the year 2016 has the "end date" 2017-01-01 (midnight at the beginning of the day) and the year 2017 has the "start date" 2017-01-01 also. So the table SHOULD have the same end-date and start-date for periods that immediately follow each other - and they should be fused together into a single interval. However, an interval ending on 2016-08-31 and one that begins on 2016-09-01 should NOT be fused together; they are separated by a full day (specifically the day of 2016-08-31 is NOT included in either interval).

The OP did not specify how the end-dates are meant to be interpreted here. I assume they are as described in the last paragraph; otherwise the solution can be easily adapted (but it will require adding 1 to end dates first, and then subtracting 1 at the end - this is exactly one of those cases when 9999-12-31 is not a good placeholder for "unknown".)

Solution:

with m as
        (
         select group_id, start_date,
                   max(end_date) over (partition by group_id order by start_date 
                             rows between unbounded preceding and 1 preceding) as m_time
         from inputs   -- "inputs" is the name of the base table
         union all
         select group_id, NULL, max(end_date) from inputs group by group_id
        ),
     n as
        (
         select group_id, start_date, m_time 
         from m 
         where start_date > m_time or start_date is null or m_time is null
        ),
     f as
        (
         select group_id, start_date,
            lead(m_time) over (partition by group_id order by start_date) as end_date
         from n
        )
select * from f where start_date is not null
;

Output (with the data provided):

  GROUP_ID START_DATE END_DATE 
---------- ---------- ----------
         1 2016-01-01 2020-01-01
         1 2022-08-31 2030-12-31
         2 2010-03-01 2017-01-01
         3 2001-01-01 8999-12-31

answered Sep 24 '22 08:09

mathguy

Related questions
                            
                                Does it matter if i write "INTEGER" or "int" in sql command inside java?[sqlite]
                            
                                Update a Column from another Column using SQLite?
                            
                                Crosstab function in Postgres returning a one row output when I expect multiple rows
                            
                                MySQL join and COUNT() on multiple tables
                            
                                millisecond in sql tsql
                            
                                Find total records in various tables in a single query
                            
                                Search for the occurrence of a list of values
                            
                                SQLAlchemy column type comparison
                            
                                How to use LINQ to get multiple totals
                            
                                Matching similar entities based on many to many relationship
                            
                                How to guarantee that at least N rows are returned by recursive CTE in Postgres
                            
                                Returning ids of a table where all values of other table exist with this id using all() or exists()
                            
                                What is the best practice database design for transactions aggregation?
                            
                                Add constraint to unique row with more than 16 columns
                            
                                How to design Redis data structures in order to perform queries similar to DB queries in redis?
                            
                                How to count consecutive duplicates in a table?
                            
                                stuck with one query in SQL Server
                            
                                alternatives to using IN clause
                            
                                Need to Pivot String values in SQL server
                            
                                How to Group By using Month from date stored as millisecond Postgres

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With