I have a database table with the following relevant columns:
group_id, value, valid_from, valid_to
None of those columns are unique (this is not the whole table).
Data could look like this:
group_id | valid_from | valid_to |
---|---|---|
1 | 20250201 | 20250230 |
2 | 20250201 | 20250230 |
2 | 20250201 | 20250230 |
3 | 20250201 | 20250215 |
3 | 20250213 | 20250220 |
3 | 20250221 | 20250230 |
4 | 20250101 | 20250220 |
4 | 20250210 | 20250215 |
4 | 20250219 | 20250230 |
5 | 20250101 | 20250115 |
5 | 20250513 | 20250619 |
So each group has one or more entries. Every entry has a date interval. The intervals can overlap and are not unique.
Now I have the impossible task to determine all groups which cover a date interval together.
Let's say I need to find all groups that cover the interval 20250201 - 20250230.
With the test data shown, I would expect to find groups 1, 2, 3 and 4.
(None of the rows of group 3 cover the interval by themselves. But they overlap and cover the interval together).
I have to solve this with SQL ONLY but I can use multiple selects.
I'm a SAP Dev and I try to solve this with a HANA AMDP Method / SQLScript but feel free to give me any SQL solution you can come up with.
My idea was to select all of the date intervals that overlap (not cover it completely) with my needed date interval. Then use window function LAG() to select the previous date intervals of the group and check if they overlap, or if there is a gap.
If they overlap, I would take the valid_from
of the previous row to widen/enhance the interval. This way I was hoping to create a entry that has the maximum covered interval of the group.
Then I could make one more select and look for entries that cover the whole interval.
tmp_1 =
SELECT
group_id,
CASE
WHEN valid_from <= ( LAG ( valid_to )
OVER ( PARTITION BY group_id
ORDER BY group_id, valid_from, valid_to ) + 1 )
THEN LAG ( valid_to )
OVER ( PARTITION BY group_id
ORDER BY group_id, valid_from, valid_to )
END as valid_from_min,
CASE
WHEN valid_to < LAG ( valid_to )
OVER ( PARTITION BY group_id
ORDER BY group_id, valid_from, valid_to )
THEN LAG ( valid_to )
OVER ( PARTITION BY group_id
ORDER BY group_id, valid_from, valid_to )
END as valid_to_max
FROM DB_GROUP
WHERE valid_from <= 20250230
AND valid_to >= 20250201;
tmp_2 =
SELECT
group_id
FROM tmp_1
WHERE valid_from_min <= 20250201
AND valid_to_max >= 20250230
GROUP BY group_id;
This only works when a group has less than 3 entries.
Second row valid_from_min
will equal valid_from
of the first row - good.
Third row valid_from_min
will equal valid_from
of the second row - bad.
LAG()
won't work if let's say interval B is included in interval A, e.g A: 0101-0131, B: 0115-0118, C: 0120-0228
: C's LAG()
is B, but there's a hole between both, while, had C looked for A, it would have seen they overlap.
What I generally use is slice each interval into subintervals (cut on each start or end of an overlap with another interval), then ensure that at least one interval covers each slice of your period of interest.
So for your group 3:
group_id | valid_from | valid_to | 1 → 12 | 13 → 15 | 16 → 20 | 21 → 30 | |
---|---|---|---|---|---|---|---|
3 | 20250201 | 20250215 | X | X | |||
3 | 20250213 | 20250220 | X | X | |||
3 | 20250221 | 20250230 | X | ||||
3 | 20250221 | 20250230 | 1 | 2 | 1 | 1 | ← count() over the group |
Small detail: we have to introduce our looked up interval's ends (0201 and 0230) into each group, so that in case a group's first interval starts on 0210 we count an empty slice from 0201 to 0209 (and get a row with a COUNT
of 0 for this slice) instead of having no slice to count (and get no row for this slice, thus forget to signal it).
To alleviate our algorithm, we introduce it as an interval of the group, so that it gets sliced like any other interval of DB_GROUP
.
Thus we finally won't look for a count of at least 1, but a count of at least 2 intervals (our reference one, and a true interval from DB_GROUP
) to overlap over each slice.
WITH
-- What period are we interested in?
fromto AS
(
SELECT
CAST('20250201' AS DATE) valid_from,
CAST('20250228' AS DATE) valid_to
FROM dummy
),
-- Groups over that period:
gr AS
(
SELECT
group_id,
GREATEST(i.valid_from, ft.valid_from) valid_from,
LEAST(i.valid_to, ft.valid_to) valid_to
FROM DB_GROUP i
-- Just keep the values that overlap our period of interest:
JOIN fromto ft ON i.valid_from <= ft.valid_to AND i.valid_to >= ft.valid_from
-- Have our period of interest get a fictive presence in each group:
UNION ALL
SELECT DISTINCT group_id, ft.valid_from, ft.valid_to FROM fromto ft, DB_GROUP
),
cuts AS
(
SELECT group_id, valid_from AS d FROM gr
UNION -- Not UNION ALL: we want only one row for identical dates.
SELECT group_id, adddays(valid_to, 1) FROM gr -- An interval ending on day D covers until the next interval starting on day D+1
),
slices AS
(
SELECT gr.group_id, cuts.d, COUNT(1) noverlaps
FROM cuts JOIN gr ON gr.group_id = cuts.group_id AND cuts.d >= valid_from AND cuts.d < valid_to
GROUP BY gr.group_id, cuts.d
)
SELECT group_id
FROM slices
GROUP BY group_id
HAVING MIN(noverlaps) > 1;
And we get the expected:
group_id |
---|
1 |
2 |
3 |
4 |
(with a PostgreSQL equivalent fiddle to see it run, add more corner cases, and select from the intermediate CTEs to get a glance into the steps)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With