Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting Data larger than the RAM size

This is a Google interview question: Given 2 machines, each having 64 GB RAM, containing all integers (8 byte), sort the entire 128 GB data. You may assume a small amount of additional RAM. Extend this to sort data stored in 1000 machines.

I came up with external sort. In that we divide the entire data into chunks and use merge sort on them. That is the first sort the chunks and put them back and get them again piece wise and merge them. Is there a better way? What would be the complexity?

like image 247
princess of persia Avatar asked Dec 21 '11 03:12

princess of persia


1 Answers

ChingPing proposes a O(n log n) sort for each subset, followed by a linear merge (by swapping the elements). The problem with Quicksort (and most of the n log n sorts, is that they require n memory. I'd recommend instead using a SmoothSort which uses constant memory, still runs in O(n log n).

The worst case scenario is where you have something like:

setA = [maxInt .. 1]
setB = [0..minInt]

where both sets are ordered in reverse, but then the merger is in the reverse order.

The (IMO - more clear) explanation of ChingPing's solution is:

Have a pointers 'pointerA', 'pointerB' initialized at the beginning of each array
While setA's pointer is not at the end
  if (setA[pointerA] < setB[pointerB])
    then { pointerA++; }
    else { swap(setA[pointerA], setB[pointerB]); pointerB++; }

The sets should both now be sorted.

like image 146
Noxville Avatar answered Oct 03 '22 01:10

Noxville