I am trying to "flatten" a list of ranges in a defined order (alphabetically by name in the examples provided) to a single merged result. Newer Ranges overwrite values of older ranges. Conceptually it looks like this, with "e" being the newest range:
0 1 2 3 4 5 6 7
|-------------a-------------|
|---b---|
|---c---|
|---d---|
|---e---|
|-a-|---c---|---e---|-d-|-a-| <-- expected result
To prevent further confusion: The expected result here is indeed correct. The values 0 - 7 are just the ranges' values, not a progression in time. I use integers for simplicity here, but the values might not be discrete but continuous.
Note that b
is completely overshadowed and not relevant anymore.
the data may be modeled like this in SQL:
create table ranges (
name varchar(1),
range_start integer,
range_end integer
);
insert into ranges (name, range_start, range_end) values ('a', 0, 7);
insert into ranges (name, range_start, range_end) values ('b', 2, 4);
insert into ranges (name, range_start, range_end) values ('c', 1, 3);
insert into ranges (name, range_start, range_end) values ('d', 4, 6);
insert into ranges (name, range_start, range_end) values ('e', 3, 5);
-- assume alphabetical order by name
It would be perfect if there was a way to directly query the result in SQL, e.g. like this:
select *magic* from ranges;
-- result:
+------+-------------+-----------+
| a | 0 | 1 |
| c | 1 | 3 |
| e | 3 | 5 |
| d | 5 | 6 |
| a | 6 | 7 |
+------+-------------+-----------+
But I suspect that is not realistically feasible, therefore I need to at least filter out all ranges that are overshadowed by newer ones, as is the case for b
in the example above. Otherwise the query would need to transfer more and more irrelevant data as the database grows and new ranges overshadow older ones. For the example above, such a query could return all entries except for b
, e.g.:
select *magic* from ranges;
-- result:
+------+-------------+-----------+
| a | 0 | 7 |
| c | 1 | 3 |
| d | 4 | 6 |
| e | 3 | 5 |
+------+-------------+-----------+
I was unable to construct such a filter in SQL. The only thing I managed to do is query all data and then calculate the result in code, for example in Java using the Google Guava library:
final RangeMap<Integer, String> rangeMap = TreeRangeMap.create();
rangeMap.put(Range.closedOpen(0, 7), "a");
rangeMap.put(Range.closedOpen(2, 4), "b");
rangeMap.put(Range.closedOpen(1, 3), "c");
rangeMap.put(Range.closedOpen(4, 6), "d");
rangeMap.put(Range.closedOpen(3, 5), "e");
System.out.println(rangeMap);
// result: [[0..1)=a, [1..3)=c, [3..5)=e, [5..6)=d, [6..7)=a]
Or by hand in python:
import re
from collections import namedtuple
from typing import Optional, List
Range = namedtuple("Range", ["name", "start", "end"])
def overlap(lhs: Range, rhs: Range) -> Optional[Range]:
if lhs.end <= rhs.start or rhs.end <= lhs.start:
return None
return Range(None, min(lhs.start, rhs.start), max(lhs.end, rhs.end))
def range_from_str(str_repr: str) -> Range:
name = re.search(r"[a-z]+", str_repr).group(0)
start = str_repr.index("|") // 4
end = str_repr.rindex("|") // 4
return Range(name, start, end)
if __name__ == '__main__':
ranges: List[Range] = [
# 0 1 2 3 4 5 6 7
range_from_str("|-------------a-------------|"),
range_from_str(" |---b---| "),
range_from_str(" |---c---| "),
range_from_str(" |---d---| "),
range_from_str(" |---e---| "),
# result: |-a-|---c---|---e---|-d-|-a-|
]
result: List[Range] = []
for range in ranges:
for i, res in enumerate(result[:]):
o = overlap(range, res)
if o:
result.append(Range(res.name, o.start, range.start))
result.append(Range(res.name, range.end, o.end))
result[i] = Range(res.name, 0, 0)
result.append(range)
result = sorted(filter(lambda r: r.start < r.end, result), key=lambda r: r.start)
print(result)
# result: [Range(name='a', start=0, end=1), Range(name='c', start=1, end=3), Range(name='e', start=3, end=5), Range(name='d', start=5, end=6), Range(name='a', start=6, end=7)]
There is also another less effective but easier and shorter approach: to generate all points and just aggregate them.
For example this simple query will generate all intermediate points:
select x,max(name)
from ranges,
xmltable('xs:integer($A) to xs:integer($B)'
passing range_start as a
,range_end as b
columns x int path '.'
)
group by x
Results:
X M
---------- -
0 a
1 c
2 c
3 e
4 e
5 e
6 d
7 a
Then we can merge them:
select *
from (
select x,max(name) name
from ranges,
xmltable('xs:integer($A) to xs:integer($B)-1'
passing range_start as a
,range_end as b
columns x int path '.'
)
group by x
order by 1
)
match_recognize(
order by x
measures
first(x) as r_start,
last(x)+1 as r_end,
last(name) as r_name
pattern(STRT A*)
define
A as prev(name)=name and prev(x)+1 = x
);
Results:
R_START R_END R
---------- ---------- -
0 1 a
1 3 c
3 5 e
5 6 d
6 7 a
Here is a hierarchical query that would give you the desired output:
WITH ranges(NAME, range_start, range_end) AS
(SELECT 'a', 0, 7 FROM dual UNION ALL
SELECT 'b', 2, 4 FROM dual UNION ALL
SELECT 'c', 1, 3 FROM dual UNION ALL
SELECT 'd', 4, 6 FROM dual UNION ALL
SELECT 'e', 3, 5 FROM dual UNION ALL
SELECT 'f', -3, -2 FROM dual UNION ALL
SELECT 'g', 8, 20 FROM dual UNION ALL
SELECT 'h', 12, 14 FROM dual)
, rm (NAME, range_start, range_end) AS
(SELECT r.*
FROM (SELECT r.NAME
, r.range_start
, NVL(r2.range_start, r.range_end) range_end
FROM ranges r
OUTER apply (SELECT *
FROM ranges
WHERE range_start BETWEEN r.range_start AND r.range_end
AND NAME > r.NAME
ORDER BY range_start, NAME DESC
FETCH FIRST 1 ROWS ONLY) r2
ORDER BY r.range_start, r.NAME desc
FETCH FIRST 1 ROWS ONLY) r
UNION ALL
SELECT r2.NAME
, r2.range_start
, r2.range_end
FROM rm
CROSS apply (SELECT r.NAME
, GREATEST(rm.range_end, r.range_start) range_start
, NVL(r2.range_start, r.range_end) range_end
FROM ranges r
OUTER apply (SELECT *
FROM ranges
WHERE range_start BETWEEN GREATEST(rm.range_end, r.range_start) AND r.range_end
AND NAME > r.NAME
ORDER BY range_start, NAME DESC
FETCH FIRST 1 ROWS ONLY) r2
WHERE r.range_end > rm.range_end
AND NOT EXISTS (SELECT 1 FROM ranges r3
WHERE r3.range_end > rm.range_end
AND (GREATEST(rm.range_end, r3.range_start) < GREATEST(rm.range_end, r.range_start)
OR (GREATEST(rm.range_end, r3.range_start) = GREATEST(rm.range_end, r.range_start)
AND r3.NAME > r.NAME)))
FETCH FIRST 1 ROWS ONLY) r2)
CYCLE NAME, range_start, range_end SET cycle TO 1 DEFAULT 0
SELECT * FROM rm
First you get the first entry ordered by range_start desc, name which will give you the most resent entry with the lowest name. Then you search for a range with higher name that intersect with this range. If there is one the range_start of this interval will be the range_end of you final interval.
With this start you search more or less the next entry with the same condition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With