Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Median of median algorithms understanding

I've searched around the web and visited the wiki page for the Median of median algorithm. But can't seem to find an explicit statement to my question:

If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?

Furthermore is this statement also correct for any of the kth statistics? I'd be interested in links to research etc in this area.

like image 470
Jared Krumsie Avatar asked Dec 02 '22 01:12

Jared Krumsie


2 Answers

The answer to your question is no.

If you want to understand how to actually select the k-th order statistics (including the median of course) in a parallel setting (distributed setting is of course not really different), take a look at this recent paper, in which I proposed a new algorithm improving the previous state of the art algorithm for parallel selection:

Deterministic parallel selection algorithms on coarse-grained multicomputers

Here, we use two weighted 3-medians as pivots, and partition around these pivots using five-way partitioning. We also implemented and tested the algorithm using MPI. Results are very good, taking into account that this is a deterministic algorithm exploiting the worst-case O(n) selection algorithm. Using the randomized O(n) QuickSelect algorithm provides an extremely fast parallel algorithm.

like image 136
Massimo Cafaro Avatar answered Jan 22 '23 09:01

Massimo Cafaro


If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?

No. The actual median of the entire list is not necessarily a median of any of the sublists.

Median-of-medians can give you a good choice of pivot for quickselect by virtue of being nearer the actual median than a randomly selected element, but you would have to do the rest of the quickselect algorithm to locate the actual median of the larger list.

like image 43
Don Roby Avatar answered Jan 22 '23 11:01

Don Roby