This is a Google interview question: Given 2 machines, each having 64 GB RAM, containing all integers (8 byte), sort the entire 128 GB data. You may assume a small amount of additional RAM. Extend this to sort data stored in 1000 machines.
I came up with external sort. In that we divide the entire data into chunks and use merge sort on them. That is the first sort the chunks and put them back and get them again piece wise and merge them. Is there a better way? What would be the complexity?
ChingPing proposes a O(n log n) sort for each subset, followed by a linear merge (by swapping the elements). The problem with Quicksort (and most of the n log n sorts, is that they require n memory. I'd recommend instead using a SmoothSort which uses constant memory, still runs in O(n log n).
The worst case scenario is where you have something like:
setA = [maxInt .. 1]
setB = [0..minInt]
where both sets are ordered in reverse, but then the merger is in the reverse order.
The (IMO - more clear) explanation of ChingPing's solution is:
Have a pointers 'pointerA', 'pointerB' initialized at the beginning of each array
While setA's pointer is not at the end
if (setA[pointerA] < setB[pointerB])
then { pointerA++; }
else { swap(setA[pointerA], setB[pointerB]); pointerB++; }
The sets should both now be sorted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With