Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find median value from a growing set

Tags:

algorithm

I came across an interesting algorithm question in an interview. I gave my answer but not sure whether there is any better idea. So I welcome everyone to write something about his/her ideas.

You have an empty set. Now elements are put into the set one by one. We assume all the elements are integers and they are distinct (according to the definition of set, we don't consider two elements with the same value).

Every time a new element is added to the set, the set's median value is asked. The median value is defined the same as in math: the middle element in a sorted list. Here, specially, when the size of set is even, assuming size of set = 2*x, the median element is the x-th element of the set.

An example: Start with an empty set, when 12 is added, the median is 12, when 7 is added, the median is 7, when 8 is added, the median is 8, when 11 is added, the median is 8, when 5 is added, the median is 8, when 16 is added, the median is 8, ...

Notice that, first, elements are added to set one by one and second, we don't know the elements going to be added.

My answer.

Since it is a question about finding median, sorting is needed. The easiest solution is to use a normal array and keep the array sorted. When a new element comes, use binary search to find the position for the element (log_n) and add the element to the array. Since it is a normal array so shifting the rest of the array is needed, whose time complexity is n. When the element is inserted, we can immediately get the median, using instance time.

The WORST time complexity is: log_n + n + 1.

Another solution is to use link list. The reason for using link list is to remove the need of shifting the array. But finding the location of the new element requires a linear search. Adding the element takes instant time and then we need to find the median by going through half of the array, which always takes n/2 time.

The WORST time complexity is: n + 1 + n/2.

The third solution is to use a binary search tree. Using a tree, we avoid shifting array. But using the binary search tree to find the median is not very attractive. So I change the binary search tree in a way that it is always the case that the left subtree and the right subtree are balanced. This means that at any time, either the left subtree and the right subtree have the same number of nodes or the right subtree has one node more than in the left subtree. In other words, it is ensured that at any time, the root element is the median. Of course this requires changes in the way the tree is built. The technical detail is similar to rotating a red-black tree.

If the tree is maintained properly, it is ensured that the WORST time complexity is O(n).

So the three algorithms are all linear to the size of the set. If no sub-linear algorithm exists, the three algorithms can be thought as the optimal solutions. Since they don't differ from each other much, the best is the easiest to implement, which is the second one, using link list.

So what I really wonder is, will there be a sub-linear algorithm for this problem and if so what will it be like. Any ideas guys?

Steve.

like image 934
Steve Avatar asked Sep 07 '09 03:09

Steve


People also ask

How do you find median from a given set of values?

To find the median, calculate the mean by adding together the middle values and dividing them by two.

How do you find the median of a big set?

Tip: For large data sets, divide the number of items by 2, then subtract 1 to find the number that should be above and the number that should be below. For example, 100/2 = 50. 50 – 1 = 49. The middle two numbers will have 49 items above and 49 below.


1 Answers

Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.

I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).

This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)

If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.

like image 146
Jonathan Graehl Avatar answered Sep 28 '22 08:09

Jonathan Graehl