Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nth smallest element in a union of an array of intervals with repetition

I want to know if there is a more efficient solution than what I came up with(not coded it yet but described the gist of it at the bottom).

Write a function calcNthSmallest(n, intervals) which takes as input a non-negative int n, and a list of intervals [[a_1; b_1]; : : : ; [a_m; b_m]] and calculates the nth smallest number (0-indexed) when taking the union of all the intervals with repetition. For example, if the intervals were [1; 5]; [2; 4]; [7; 9], their union with repetition would be [1; 2; 2; 3; 3; 4; 4; 5; 7; 8; 9] (note 2; 3; 4 each appear twice since they're in both the intervals [1; 5] and [2; 4]). For this list of intervals, the 0th smallest number would be 1, and the 3rd and 4th smallest would both be 3. Your implementation should run quickly even when the a_i; b_i can be very large (like, one trillion), and there are several intervals

The way I thought to go about it is the straightforward solution which is to make the union array and traverse it.

like image 220
Don Quixote Avatar asked Nov 22 '12 07:11

Don Quixote


1 Answers

This problem can be solved in O(N log N) where N is the number of intervals in the list, regardless of the actual values of the interval endpoints.

The key to solving this problem efficiently is to transform the list of possibly-overlapping intervals into a list of intervals which are either disjoint or identical. In the given example, only the first interval needs to be split:

{       [1,5],        [2,4], [7,9]} =>
 +-----------------+  +---+  +---+
{[1,1], [2,4], [5,5], [2,4], [7,9]}

(This doesn't have to be done explicitly, though: see below.) Now, we can sort the new intervals, replacing duplicates with a count. From that, we can compute the number of values each (possibly-duplicated) interval represents. Now, we simply need to accumulate the values to figure out which interval the solution lies in:

interval  count  size    values     cumulative
                       in interval    values
  [1,1]     1      1        1         [0, 1)
  [2,4]     2      3        6         [1, 7)  (eg. from n=1 to n=6 will be here)
  [5,5]     1      1        1         [7, 8)
  [7,9]     1      3        3         [8, 11)

I wrote the cumulative values as a list of half-open intervals, but obviously we only need the end-points. We can then find which interval holds value n by, for example, binary-searching the cumulative values list, and we can figure out which value in the interval we want by subtracting the start of the interval from n and then integer-dividing by the count.

It should be clear that the maximum size of the above table is twice the number of original intervals, because every row must start and end at either the start or end of some interval in the original list. If we'd written the intervals as half-open instead of closed, this would be even clearer; in that case, we can assert that the precise size of the table will be the number of unique values in the collection of end-points. And from that insight, we can see that we don't really need the table at all; we just need the sorted list of end-points (although we need to know which endpoint each value represents). We can simply iterate through that list, maintaining the count of the number of active intervals, until we reach the value we're looking for.

Here's a quick python implementation. It could be improved.

def combineIntervals(intervals):
  # endpoints will map each endpoint to a count
  endpoints = {}
  # These two lists represent the start and (1+end) of each interval
  # Each start adds 1 to the count, and each limit subtracts 1
  for start in (i[0] for i in intervals):
    endpoints[start] = endpoints.setdefault(start, 0) + 1
  for limit in (i[1]+1 for i in intervals):
    endpoints[limit] = endpoints.setdefault(limit, 0) - 1
  # Filtering is a possibly premature optimization but it was easy
  return sorted(filter(lambda kv: kv[1] != 0,
                       endpoints.iteritems()))

def nthSmallestInIntervalList(n, intervals):
  limits = combineIntervals(intervals)
  cumulative = 0
  count = 0
  index = 0
  here = limits[0][0]
  while index < len(limits):
    size = limits[index][0] - here
    if n < cumulative + count * size:
      # [here, next) contains the value we're searching for
      return here + (n - cumulative) / count
    # advance
    cumulative += count * size
    count += limits[index][1]
    here += size
    index += 1
  # We didn't find it. We could throw an error

So, as I said, the running time of this algorithm is independent of the actual values of the intervals; it only depends in the length of the interval list. This particular solution is O(N log N) because of the cost of the sort (in combineIntervals); if we used a priority queue instead of a full sort, we could construct the heap in O(N) but making the scan O(log N) for each scanned endpoint. Unless N is really big and the expected value of the argument n is relatively small, this would be counter-productive. There might be other ways to reduce complexity, though.

like image 53
rici Avatar answered Sep 20 '22 12:09

rici