Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental median computation with max memory efficiency

Tags:

I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.

If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?

Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.

like image 535
Mau Avatar asked Jul 30 '10 13:07

Mau


2 Answers

You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.

Edit:

If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.

like image 110
deinst Avatar answered Oct 10 '22 13:10

deinst


You can

  • Use statistics, if that's acceptable - for example, you could use sampling.
  • Use knowledge about your number stream
    • using a counting sort like approach: k distinct values means storing O(k) memory)
    • or toss out known outliers and keep a (high,low) counter.
    • If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).
like image 27
Stephen Avatar answered Oct 10 '22 13:10

Stephen