I have a database table where there are three columns that are essential to this question:
I want to make a view from this table so that overlapping date intervals, that have the same grouping ID, are flattened.
Date intervals that are not overlapping shall not be flattened.
Example:
Group ID Start End
1 2016-01-01 2017-12-31
1 2016-06-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
2 2012-01-01 2013-12-31
3 2001-01-01 9999-13-31
...becomes...
Group ID Start End
1 2016-01-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
3 2001-01-01 9999-12-31
Intervals that overlap may do so in any way, completely enclosed by other intervals, or they may be staggered, or they may even have the same start and/or end dates.
There are few similar ids. Commonly (> 95%) there is only one row with a particular group ID. There are about a thousand IDs that show up in two rows; a handful of IDs that exist in three rows; none that are in four rows or more.
But I need to be prepared that there may show up group IDs that exist in four or more rows.
How can I write an SQL statement that creates a view that shows the table flattened this way?
Do note that every row also has a unique ID. This does not need to be preserved in any way, but in case it helps when writing the SQL, I am letting you know.
FLATTEN is a table function that takes a VARIANT, OBJECT, or ARRAY column and produces a lateral view (i.e. an inline view that contains correlation referring to other tables that precede it in the FROM clause). FLATTEN can be used to convert semi-structured data to a relational representation.
Select a column with a date data type from a table in the Diagram tab. Select the Where Condition field below the date column and click . Select the type of calendar to use for the date range values. If you select a calendar type other than Gregorian, you can click to edit the selected calendar.
To find the difference between dates, use the DATEDIFF(datepart, startdate, enddate) function. The datepart argument defines the part of the date/datetime in which you'd like to express the difference. Its value can be year , quarter , month , day , minute , etc.
First, find intervals that are not continuation of overlapping sequence:
select *
from dateclap d1
where not exists(
select *
from dateclap d2
where d2.group_id=d1.group_id and
d2.end_date >= d1.start_date and
(d2.start_date < d1.start_date or
(d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
Last line distinguishes intervals starting at the same date/time, ordering them by unique record id (r_id).
Then for each such record we can get hierarchical selection of records with connect_by_root r_id distinguishing clamp groups. After that all we need is to get min/max for clamp group (connect_by_root r_id is id of parent record in group):
select group_id, min(start_date) as start_date, max(end_date) as end_date
from dateclap d1
start with not exists(
select *
from dateclap d2
where d2.group_id=d1.group_id and
d2.end_date >= d1.start_date and
(d2.start_date < d1.start_date or
(d1.start_date=d2.start_date and d2.r_id<d1.r_id)))
connect by nocycle
prior group_id=group_id and
start_date between prior start_date and prior end_date
group by group_id, connect_by_root r_id
Note nocycle here - it is a dirty trick to avoid exceptions because connection is weak and in fact tries to connect record to itself. You can refine condition after "connect by" similar to "exists" condition to avoid nocycle usage.
P.S. Table was created for tests like this:
CREATE TABLE "ANIKIN"."DATECLAP"
(
"R_ID" NUMBER,
"GROUP_ID" NUMBER,
"START_DATE" DATE,
"END_DATE" DATE
) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "ANIKIN" ;
Unique key (or probably primary key) for r_id and corresponding seuqence/triggers are not something specific to tests, just populate r_id with unique values.
select t1.group_id, least(min(t1.start_date), min(t2.start_date)), greatest(max(t1.start_date), max(t2.end_date)) from test_interval t1, test_interval t2
where (t1.start_date, t1.end_date) overlaps (t2.start_date, t2.end_date)
and t1.rowid <> t2.rowid
and t1.group_id = t2.group_id group by t1.group_id;
Such query produces for me list of overlapping intervals. OVERLAPS is an undocumented operator. I only wonder if that won't return wrong result when we got two pair of intervals that are overlapping in pair but not each other. Where I used rowid you can use your unique row identifier
Create 2 functions that return the flattened start- and end-date for a specific element:
CREATE OR REPLACE FUNCTION getMinStartDate
(
p_group_id IN NUMBER,
p_start IN DATE
)
RETURN DATE AS
v_result DATE;
BEGIN
SELECT MIN(start_date)
INTO v_result
FROM my_data
WHERE group_id = p_group_id
AND start_date <= p_start
AND end_date >= p_start;
RETURN v_result;
END getMinStartDate;
CREATE OR REPLACE FUNCTION getMaxEndDate
(
p_group_id IN NUMBER,
p_end IN DATE
)
RETURN DATE AS
v_result DATE;
BEGIN
SELECT MAX(end_date)
INTO v_result
FROM my_data
WHERE group_id = p_group_id
AND start_date <= p_end
AND end_date >= p_end;
RETURN v_result;
END getMaxEndDate;
Your view should then return, for each element, these flattened dates.
Of course, DISTINCT
since various elements may result in the same dates:
SELECT DISTINCT
group_id,
getMinStartDate(group_id, start_date) AS start_date,
getMaxEndDate(group_id, end_date) AS end_date
FROM my_data;
The input data shows an end date of 9999-13-31 in the last row. That should be corrected.
With that said, it is best to choose a made-up end date that is not exactly 9999-12-31. In many problems one needs to add a day, or a couple of weeks, or whatever, to all the dates in a table; but if one tries to add to 9999-12-31, that will fail. I prefer 8999-12-31; one thousand years should be enough for most computations. {:-) In the test data I created for my query I used this convention. (The solution can be easily adapted for 9999-12-31 though.)
When working with datetime intervals, remember that a pure date means midnight at the beginning of a day. So the year 2016 has the "end date" 2017-01-01 (midnight at the beginning of the day) and the year 2017 has the "start date" 2017-01-01 also. So the table SHOULD have the same end-date and start-date for periods that immediately follow each other - and they should be fused together into a single interval. However, an interval ending on 2016-08-31 and one that begins on 2016-09-01 should NOT be fused together; they are separated by a full day (specifically the day of 2016-08-31 is NOT included in either interval).
The OP did not specify how the end-dates are meant to be interpreted here. I assume they are as described in the last paragraph; otherwise the solution can be easily adapted (but it will require adding 1 to end dates first, and then subtracting 1 at the end - this is exactly one of those cases when 9999-12-31 is not a good placeholder for "unknown".)
Solution:
with m as
(
select group_id, start_date,
max(end_date) over (partition by group_id order by start_date
rows between unbounded preceding and 1 preceding) as m_time
from inputs -- "inputs" is the name of the base table
union all
select group_id, NULL, max(end_date) from inputs group by group_id
),
n as
(
select group_id, start_date, m_time
from m
where start_date > m_time or start_date is null or m_time is null
),
f as
(
select group_id, start_date,
lead(m_time) over (partition by group_id order by start_date) as end_date
from n
)
select * from f where start_date is not null
;
Output (with the data provided):
GROUP_ID START_DATE END_DATE
---------- ---------- ----------
1 2016-01-01 2020-01-01
1 2022-08-31 2030-12-31
2 2010-03-01 2017-01-01
3 2001-01-01 8999-12-31
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With