Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to produce a difference of two collections of intervals

Problem

Suppose I have two collections of intervals, named A and B. How would I find a difference (a relative complement) in a most time- and memory-efficient way?

Picture for illustration: enter image description here

Interval endpoints are integers ( ≤ 2128-1 ) and they are always both 2n long and aligned on the m×2n lattice (so you can make a binary tree out of them).

Intervals can overlap in the input but this does not affect the output (the result if flattened would be the same).

The problem is because there are MANY intervals in both collections (up to 100,000,000), so naïve implementations will probably be slow.

The input is read from two files and it is sorted in such a way that smaller sub-intervals (if overlapping) come immediately after their parents in order of size. For example:

[0,7]
[0,3]
[4,7]
[4,5]
[8,15]
...

What have I tried?

So far, I've been working on a implementation that generates a binary search tree while doing so aggregates neighbouring intervals ( [0,3],[4,7] => [0,7] ) from both collections, then traverses the second tree and "bumps out" the intervals that are present in both (subdividing the larger intervals in the first tree if necessary).

While this appears to be working for small collections, it requires more and more RAM to hold the tree itself, not to mention the time it needs to complete the insertion and removal from the tree.

I figured that since intervals come pre-sorted, I could use some dynamic algorithm and finish in one pass. I am not sure if this is possible, however.


So, how would I go about solving this problem in a efficient way?

Disclaimer: This is not a homework but a modification/generalization of an actual real-life problem I am facing. I am programming in C++ but I can accept an algorithm in any [imperative] language.


1 Answers

Recall one of the first programming exercises we all had back in school - writing a calculator program. Taking an arithmetic expression from the input line, parsing it and evaluating. Remember keeping track of the parentheses depth? So here we go.

Analogy: interval start points are opening parentheses, end points - closing parentheses. We keep track of the parentheses depth (nesting). The depth of two - intersection of intervals, the depth of one - difference of intervals

Algorithm:

  • No need to distinguish between A and B, just sort all start points and end points in the ascending order
  • Set the parentheses depth counter to zero
  • Iterate through the end points, starting from the smallest one. If it is a starting point increment the depth counter, if it is an ending point decrement the counter
  • Keep track of intervals where the depth is 1, those are intervals of A and B difference. The intervals where the depth is two are AB intersections
like image 67
Arik G Avatar answered Sep 13 '25 20:09

Arik G