Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Store the largest 5000 numbers from a stream of numbers

Tags:

Given the following problem:

"Store the largest 5000 numbers from a stream of numbers"

The solution which springs to mind is a binary search tree maintaining a count of the number of nodes in the tree and a reference to the smallest node once the count reaches 5000. When the count reaches 5000, each new number to add can be compared to the smallest item in the tree. If greater, the new number can be added then the smallest removed and the new smallest calculated (which should be very simple already having the previous smallest).

My concern with this solution is that the binary tree is naturally going to get skewed (as I'm only deleting on one side).

Is there a way to solve this problem which won't create a terribly skewed tree?

In case anyone wants it, I've included pseudo-code for my solution so far below:

process(number)
{
  if (count == 5000 && number > smallest.Value)
  {
    addNode( root, number)
    smallest = deleteNodeAndGetNewSmallest ( root, smallest)
  }
}

deleteNodeAndGetNewSmallest( lastSmallest)
{
  if ( lastSmallest has parent)
  {
    if ( lastSmallest has right child)
    {
      smallest = getMin(lastSmallest.right)
      lastSmallest.parent.right = lastSmallest.right
    }
    else
    {
      smallest = lastSmallest.parent
    }
  }
  else 
  {
    smallest = getMin(lastSmallest.right)
    root = lastSmallest.right
  }
  count--
  return smallest
}

getMin( node)
{
  if (node has left)
    return getMin(node.left)
  else
    return node
}

add(number)
{
  //standard implementation of add for BST
  count++
}
like image 717
Rich Avatar asked May 25 '12 09:05

Rich


1 Answers

The simplest solution for this is maintaining a min heap of max size 5000.

  • Every time a new number arrives - check if the heap is smaller then 5000, if it is - add it.
  • If it is not - check if the minimum is smaller then the new element, and if it is, pop it out and insert the new element instead.
  • When you are done - you have a heap containing 5000 largest elements.

This solution is O(nlogk) complexity, where n is the number of elements and k is the number of elements you need (5000 in your case).

It can be done also in O(n) using selection algorithm - store all the elements, and then find the 5001th largest element, and return everything bigger than it. But it is harder to implement and for reasonable size input - might not be better. Also, if stream contains duplicates, more processing is needed.

like image 51
amit Avatar answered Sep 19 '22 22:09

amit