Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickest way to determine range overlap in Perl

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).

$a_ranges =
{
  a_1 =>
  {
    start => ...,
    end   => ...,
  },
  a_2 =>
  {
    start => ...,
    end   => ...,
  },
  a_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

$b_ranges =
{
  b_1 =>
  {
    start => ...,
    end   => ...,
  },
  b_2 =>
  {
    start => ...,
    end   => ...,
  },
  b_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.

I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.

Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?

Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.

like image 665
Daniel Standage Avatar asked Dec 01 '22 03:12

Daniel Standage


2 Answers

This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.

I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.

Hope this helps!

like image 118
templatetypedef Avatar answered Dec 04 '22 06:12

templatetypedef


In case you are not exclusively tied to perl; The IRanges package in R deals with interval arithmetic. It has very powerful primitives, it would probably be easy to code a solution with them.

A second remark is that the problem could possibly become very easy if the intervals have additional structure; for example, if within each set of ranges there is no overlap (in that case a linear approach sifting through the two ordered sets simultaneously is possible). Even in the absence of such structure, the least you can do is to sort one set of ranges by start point, and the other set by end point, then break out of the inner loop once a match is no longer possible. Of course, existing and general algorithms and data structures such as the interval tree mentioned earlier are the most powerful.

like image 45
micans Avatar answered Dec 04 '22 05:12

micans