Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate the median of a billion numbers

If you have one billion numbers and one hundred computers, what is the best way to locate the median of these numbers?

One solution which I have is:

  • Split the set equally among the computers.
  • Sort them.
  • Find the medians for each set.
  • Sort the sets on medians.
  • Merge two sets at a time from the lowest to the highest median.

If we have m1 < m2 < m3 ... then first merge Set1 and Set2 and in the resulting set we can discard all the numbers lower than the median of Set12 (merged). So at any point of time we have equal sized sets. By the way this cannot be done in a parallel manner. Any ideas?

like image 243
anony Avatar asked Oct 02 '22 05:10

anony


People also ask

How do you find the median of large numbers?

Tip: For large data sets, divide the number of items by 2, then subtract 1 to find the number that should be above and the number that should be below. For example, 100/2 = 50. 50 – 1 = 49. The middle two numbers will have 49 items above and 49 below.

How do you calculate the median of numbers?

To find the median, calculate the mean by adding together the middle values and dividing them by two.

What is the formula of median data?

If the number of data points is even, the median will be the average of the data point of rank n ÷ 2 and the data point of rank (n ÷ 2) + 1.


2 Answers

Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind:

Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close.

Each other machine sorts its data, and does so in a way which favours finding the lower values first. So for example a quicksort, always sorting the lower part of the partition first[*]. It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit).

The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values.

This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine. There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data. The "combination" is somewhere between the maximum and the sum of those times, probably close to the max.

My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network. Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data.

Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine. For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point.

More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines.

This involves round-trips between the machines, though, which my simpler first version avoids. I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem.

[*] available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space. But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.

like image 54
Steve Jessop Avatar answered Oct 16 '22 18:10

Steve Jessop


sort -g numbers | head -n 500000001 | tail -n 2 | dc -e "1 k ? ? + 2 / p"
like image 52
DrPizza Avatar answered Oct 16 '22 20:10

DrPizza