Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for handling closed intervals

I was learning for the test when I spotted problem I cannot deal with:

Design a data structure for handling (closed) intervals which will provide three operations:

Insert(x,y) - add interval [x, y]

Remove(x,y) - remove interval [x, y]

Count(x,y) - count the number of intervals that are purely contained within the interval [x, y]

x,y are integers from 1 to n. All operations can take at most time O((log n)2)

Can anyone help?

like image 650
xan Avatar asked Dec 11 '12 22:12

xan


2 Answers

This can be solved in O(log^2 n) time and O(nlog n) space using a combination of a Fenwick Tree and a data structure based on a segment tree, which houses an internal tree inside each node. The former is used for efficiently finding and updating counts of points within a given range; the latter for efficiently finding and updating counts of segments that straddle a given point. The basic idea is to count segment endpoints within the given query range, and then adjust for segments that cross either or both endpoints.

This algorithm is based on the observation that, for a given query range (a, b),

nContained = (h - nStraddlingOne) / 2

where nContained is the number of intervals contained by (a, b), h is the number of interval endpoints of either kind (beginning or ending) in (a, b), and nStraddlingOne is the number of intervals that contain either only a or only b. The "database" of intervals at any time is called D here.

A Fenwick tree allows you to count the total number of points between two integer indices (a, b) in O(log n) time, using an O(n) table. It also permits O(log n) insertions and deletions. We add both end points of every interval to it, and use it to calculate h in O(log n) time.

Counting crossings

Our other data structure is very similar to a segment tree, but there are two main differences: instead of inferring the start and end points of intervals from the input set of intervals, we take the set of endpoints to be every possible integer between 1 and n inclusive, and have no "open sets" (this is to simplify the addition of new intervals later); and we store each node's interval set in a particular way.

Suppose for simplicity that n is a power of 2 (if it isn't, just pick the next biggest power of 2 -- this will cause an increase of less than n, so the time complexity won't change). Let T be a complete binary tree that has a leaf node for each position 1 <= i <= n. (This tree will have 2n - 1 nodes in total.) Every leaf node represents a single position. Every non-leaf node represents the union of the positions of all leaf nodes under it, which must form an interval having length a power of 2. Call the interval represented by a node v Int(v). (Note: because this binary tree is complete, it can be represented "implicitly" the same way a binary heap often is, to save space.)

To each node v of T, there corresponds a set of intervals called Span(v). (We will only actually store their rightmost endpoints.) Span(v) is the set of all intervals in D that

  1. Contain Int(v), and
  2. Do not contain Int(parent(v)).

In each vertex v, we store only the rightmost endpoints of the intervals in Span(v) in a self-balancing binary tree that is ordered by this end point, and in which every node is augmented with a count of descendant nodes. That is, every node of the "outer" tree contains an "inner" tree. This is necessary to enable us to efficiently calculate how many intervals completely contain a given query interval.

The fundamental query operation on a segment tree is determining the set of intervals that contains a given point x, and this is easily changed to count rather than report individual intervals. The operation is simple: setting v to the root,

  1. Add the number of intervals in Span(v) to the total.
  2. If v is a leaf, then stop.
  3. Otherwise, suppose the left and right children of v are u and w respectively. If x is contained in Int(u), then recurse on u, otherwise recurse on w.

Given a query interval (a, b), the above query can be performed twice, once for a and once for b. Set nStraddlingAtLeastOne to the sum of the counts for both queries. Note that counting can be done in O(1) time for each node -- it is just the count field of the root node of the self-balancing binary tree storing Span(v).

Double crossings!

The difficulty (and what eventually stymied an earlier attempt at a O(log n) algorithm that I spent some time on) is that we also need to count the number of intervals that simultaneously span both endpoints of the query (a, b). To do this, we start with v at the root again, and perform a modified query for a (the starting point of the query), in which step 1 is replaced with:

  1. Count the number of intervals in Span(v) having right endpoint >= b, and add it to the total.

This counting step can be performed in O(log n) time due to the self-balancing binary tree. Since each tree level needs up to 2 of these counting operations, this is the part that pushes the time complexity up to O(log^2 n). Set nContaining to the total from this query. We can calculate nStraddlingOne using

nStraddlingOne = nStraddlingAtLeastOne - nContaining

Using nStraddlingOne and the h calculated from the Fenwick tree, we can now calculate nContained according to the equation at the top.

Adding and removing intervals

Updating is O(log^2 n) too, since updating the Fenwick tree is O(log n), and adding an interval (x, y) to the segment tree takes O(log^2 n) time using the following algorithm, starting with v at the root:

  1. If (x, y) contains Int(v), then add y to the self-balancing binary tree storing the right endpoints of Span(v) and stop.
  2. Otherwise, suppose the left and right children of v are u and w respectively.
    • If (x, y) overlaps Int(u), recurse on u.
    • If (x, y) overlaps Int(w), recurse on w.

The above traversal visits only O(log n) nodes in the segment tree, because each interval added will appear in the interval sets of at most 2 nodes on any tree level, for a maximum of 2log(n) space usage per interval. Please see the Wikipedia page on Segment Trees for a proof and further explanation. Removing an interval uses a similar algorithm. Each insertion or removal into the self-balancing binary tree at a node takes O(log n) time, for O(log^2 n) time in total.

Space usage

Space usage is O(nlog n), since there are O(log n) tree levels in the segment tree, each of which may require space for 2 instances of an internal tree node containing each possible endpoint. Note in particular that even though there are O(n^2) possible distinct intervals, we store only the rightmost endpoint of each interval, so this is not a problem. Also because we store counts, adding a second copy of an existing interval only causes counts to increase -- it does not cause any new node allocations. Fenwick trees use only O(n) space.

like image 123
j_random_hacker Avatar answered Oct 10 '22 07:10

j_random_hacker


See Range Tree. The time complexity for querying is mentioned as O(logd n + k). Here d is dimension, n is number of points stored in the tree and k is number of points reported by the query.

But we need only the count without reporting the points. So I think if number of children (actually number of leaves because the real points are stored in the leaves) is maintained on each node, this k can be eliminated leaving us with O(log2 n). Insertion and deletion are also O(log2 n).

like image 31
user1168577 Avatar answered Oct 10 '22 08:10

user1168577