Algorithm to produce a difference of two collections of intervals

Question

Problem

Suppose I have two collections of intervals, named A and B. How would I find a difference (a relative complement) in a most time- and memory-efficient way?

Picture for illustration: enter image description here

Interval endpoints are integers ( ≤ 2¹²⁸-1 ) and they are always both 2ⁿ long and aligned on the m×2ⁿ lattice (so you can make a binary tree out of them).

Intervals can overlap in the input but this does not affect the output (the result if flattened would be the same).

The problem is because there are MANY intervals in both collections (up to 100,000,000), so naïve implementations will probably be slow.

The input is read from two files and it is sorted in such a way that smaller sub-intervals (if overlapping) come immediately after their parents in order of size. For example:

[0,7]
[0,3]
[4,7]
[4,5]
[8,15]
...

What have I tried?

So far, I've been working on a implementation that generates a binary search tree while doing so aggregates neighbouring intervals ( [0,3],[4,7] => [0,7] ) from both collections, then traverses the second tree and "bumps out" the intervals that are present in both (subdividing the larger intervals in the first tree if necessary).

While this appears to be working for small collections, it requires more and more RAM to hold the tree itself, not to mention the time it needs to complete the insertion and removal from the tree.

I figured that since intervals come pre-sorted, I could use some dynamic algorithm and finish in one pass. I am not sure if this is possible, however.

So, how would I go about solving this problem in a efficient way?

Disclaimer: This is not a homework but a modification/generalization of an actual real-life problem I am facing. I am programming in C++ but I can accept an algorithm in any [imperative] language.

Arik G · Accepted Answer

Recall one of the first programming exercises we all had back in school - writing a calculator program. Taking an arithmetic expression from the input line, parsing it and evaluating. Remember keeping track of the parentheses depth? So here we go.

Analogy: interval start points are opening parentheses, end points - closing parentheses. We keep track of the parentheses depth (nesting). The depth of two - intersection of intervals, the depth of one - difference of intervals

Algorithm:

No need to distinguish between A and B, just sort all start points and end points in the ascending order
Set the parentheses depth counter to zero
Iterate through the end points, starting from the smallest one. If it is a starting point increment the depth counter, if it is an ending point decrement the counter
Keep track of intervals where the depth is 1, those are intervals of A and B difference. The intervals where the depth is two are AB intersections

Algorithm to produce a difference of two collections of intervals

Tags:

c++

language-agnostic

algorithm

binary-tree

Problem

What have I tried?

1 Answers

Arik G

Recent Activity

Donate For Us

Algorithm to produce a difference of two collections of intervals

Tags:

c++

language-agnostic

algorithm

binary-tree

Problem

What have I tried?

1 Answers

Arik G

Related questions

Recent Activity

Donate For Us