Finding the minimum value of the maximum cluster?

Question

Define an item as having:

a unique id
a value
a creation time
a deletion time

I have two input streams - one that informs me when an item is created, one that informs me when the item is deleted. Call an item that has been created but not destroyed "living."

I can track the maximum value of all living items using a heap:

whenCreated(item):
  i = heap.size
  heap-up(item, heap.size)
  heap.size = heap.size + 1
  max-value = heap[0]

whenDeleted(item):
  ktem = heap[heap.size - 1]
  heap.size = heap.size - 1
  heap-up(ktem, index[item.id])
  heap-down(ktem, index[ktem.id])
  max-value = heap[0]

heap-up(item, i):
  while (i > 0):
    j = floor( (i-1) / 2 )
    jtem = heap[j]
    if (jtem.value > item.value):
      break while
    index[jtem.id] = i 
    heap[i] = heap[i]
    i = j
  index[item.id] = i
  heap[i] = item

heap-down(item, i):
  while (2*i + 1 < heap.size):
    if (2*i + 1 == heap.size or heap[2*i+1].value > heap[2*i+2].value):
      j = 2*i + 1
    else
      j = 2*i + 2          
    jtem = heap[j]
    if (jtem.value < item.value):
      break while
    index[jtem.id] = i
    heap[i] = heap[i] 
    i = j         
  index[item.id] = i
  heap[i] = item

If I've got n items, then adding or deleting one takes O(log n) time.

Now suppose the items are clustered such that given two items, a and b, |a.value - b.value| < delta ⇒ a and b are in the same cluster.

For example, if we've got the values (1, 2, 3, 4, 7, 8, 11, 13, 14, 15, 16) and delta = 2, then the clusters are (1, 2, 3, 4), (7, 8), (11), and (13, 14, 15, 16).

I'd like to track the minimum value of the cluster that contains the maximum living value. I can do this by reading values off the heap in order until I find a gap between values of size greater than equal to delta. However, this takes O(n) time, which seems rather inefficent.

Is there an O(log n) algorithm to track the minimum value of that cluster?

templatetypedef · Accepted Answer

I believe that you can do this using a balanced binary search tree of splay trees to guarantee O(log n) amortized time for each operation.

Suppose that we weren't doing any clustering. In that case, you could just store all the elements in a balanced binary search tree to get O(log n) insertion, deletion, and find-min. But with clustering, this changes. My suggestion is to maintain a BST of the clusters sorted by the range of values held in the cluster, where each cluster is represented as a splay tree of the nodes it contains.

To insert an element into the data structure, do a search in the BST for the predecessor and successor of the element in question. If the node belongs to neither of these clusters, create a singleton splay tree out of that node and insert it into the BST. If it's contained in exactly one of the two clusters you found, insert it into that cluster. If it's contained in both clusters, then remove both clusters from the BST, merge them together into one cluster, insert the new node into that cluster, then reinsert the cluster into the BST. The lookup time in all cases in O(log n) to find the two clusters, then O(log n) amortized time to insert into a cluster. Merging two splay trees is actually fast here; since he clusters were not merged before, one tree holds values that are all strictly greater than all the values in the other tree, so the merge can be done in O(log n) amortized time by retiring pointers. The cost of removing the two trees and adding them back in is also O(log n).

To find the minimum value of the maximum cluster, find the maximum cluster in the BST in O(log n) time, then find the minimum element of the cluster you find in amortized O(log n) time.

To delete an element, find the cluster that contains it in O(log n) time. If it is in its own cluster, delete that cluster Fromm the tree. If not, remove the element from the cluster it's in, then find its predecessor and successor within that cluster. If they are within delta of one another, then the cluster is still fine and we are done. Otherwise, the cluster must be split. In O(log n) amortized time, split the cluster into the cluster of nodes less than or equal to the predecessor and greater or equal to the successor, then reinsert both clusters into the tree.

Overall, this gives O(log n) amortized for each operation.

Hope this helps!

Finding the minimum value of the maximum cluster?

Tags:

language-agnostic

algorithm

stream

data-structures

cluster-analysis

rampion

1 Answers

templatetypedef

Recent Activity

Donate For Us

Finding the minimum value of the maximum cluster?

Tags:

language-agnostic

algorithm

stream

data-structures

cluster-analysis

rampion

1 Answers

templatetypedef

Related questions

Recent Activity

Donate For Us