I was asked this question in an interview recently. There are N numbers, too many to fit into memory. They are split across k database tables (unsorted), each of which can fit into memory. Find the median of all the numbers. Wasn't quite sure about the answer to this one.

There's a few potential solutions: <ul> <li>External merge sort - O(n log n) You basically sort the numbers on the first pass, then find the median on the second. </li> <li>Order statistics distributed selection algorithm - O(n) Simplify the problem to the original problem of finding the kth number in an unsorted array.</li> <li>Counting sort histogram O(n) You have to assume some properties about the range of the numbers - can the range fit in the memory?</li> <li>If anything is known about the distribution of the numbers other algorithms can be produced.</li> </ul> For more details and implementation see: http://www.fusu.us/2013/07/median-in-large-set-across-1000-servers.html

Finding median of large set of numbers too big to fit into memory

2 Answers

There's a few potential solutions:

External merge sort - O(n log n)
You basically sort the numbers on the first pass, then find the median on the second.
Order statistics distributed selection algorithm - O(n)
Simplify the problem to the original problem of finding the kth number in an unsorted array.
Counting sort histogram O(n)
You have to assume some properties about the range of the numbers - can the range fit in the memory?
If anything is known about the distribution of the numbers other algorithms can be produced.

For more details and implementation see:
http://www.fusu.us/2013/07/median-in-large-set-across-1000-servers.html

133

answered Sep 28 '22 21:09

user1712376

This answer on quora explains the whole process clearly step by step http://qr.ae/dMkGc. Simply copying it down for non Quorans

Suppose you have a master node (or are able to use a consensus protocol to elect a master from among your servers). The master first queries the servers for the size of their sets of data, call this n, so that it knows to look for the k = n/2 largest element.

The master then selects a random server and queries it for a random element from the elements on that server. The master broadcasts this element to each server, and each server partitions its elements into those larger than or equal to the broadcasted element and those smaller than the broadcasted element.

Each server returns to the master the size of the larger-than partition, call this m. If the sum of these sizes is greater than k, the master indicates to each server to disregard the less-than set for the remainder of the algorithm. If it is less than k, then the master indicates to disregard the larger-than sets and updates k = k - m. If it is exactly k, the algorithm terminates and the value returned is the pivot selected at the beginning of the iteration.

If the algorithm does not terminate, recurse beginning with selecting a new random pivot from the remaining elements.

Analysis:

Let n be the total number of elements and s be the number of servers. Assume that the elements are roughly randomly and evenly distributed among servers (each server has O(n/s) elements). In iteration i, we expect to do about O(n/(s*2^i)) work on each server, as the size of each servers element sets will be approximately cut in half (remember, we assumed roughly random distribution of elements) and O(s) work on the master (for broadcasting/receiving messages and adding the sizes together). We expect O(log(n/s)) iterations. Adding these up over all iterations gives an expected runtime of O(n/s + slog(n/s)), and assuming s << sqrt(n) which is normally the case, this becomes simply (O(n/s)), which is the best you could possibly hope for.

Note also that this works not just for finding the median but also for finding the kth largest value for any value of k.

answered Sep 28 '22 21:09

theja_swarup

Related questions
                            
                                Fast Algorithm to Quickly Find the Range a Number Belongs to in a Set of Ranges?
                            
                                Check if a spelled number is in a range in C++
                            
                                Hashing a Tree Structure
                            
                                Rotating an array using Juggling algorithm
                            
                                Create your own MD5 collisions
                            
                                Given a 1 TB data set on disk with around 1 KB per data record, how can I find duplicates using 512 MB RAM and infinite disk space?
                            
                                Calculating which tiles are lit in a tile-based game ("raytracing")
                            
                                Fast n choose k mod p for large n?
                            
                                Rebalancing an arbitrary BST?
                            
                                How can I apply reinforcement learning to continuous action spaces?
                            
                                What algorithm to use to determine minimum number of actions required to get the system to "Zero" state?
                            
                                Remove text in-between delimiters in a string (using a regex?)
                            
                                Percentiles of Live Data Capture
                            
                                What STL algorithm can determine if exactly one item in a container satisfies a predicate?
                            
                                How to do unsigned saturating addition in C?
                            
                                Context-free grammars versus context-sensitive grammars?
                            
                                What's the algorithm behind sleep()?
                            
                                Which is the fastest way to get the absolute value of a number
                            
                                Sum of digits in C#
                            
                                Efficient floating-point division with constant integer divisors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding median of large set of numbers too big to fit into memory

Tags:

algorithm

meteoritepanama

People also ask

2 Answers

user1712376

This answer on quora explains the whole process clearly step by step http://qr.ae/dMkGc. Simply copying it down for non Quorans

theja_swarup

Recent Activity

Donate For Us