Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort 1TB file on machine with 1GB RAM

This questions seems easy, but I am not able to understand the real work behind it. I know people will say, break down into 512 Megs chunks and sort them like using Merge Sort using Map reduce.

So here is the actual question i have:

Suppose i break the file into 512 Megs chunk and then send to different host machines to sort them. suppose these machines used the Merge Sort. Now say, i had 2000 machines each sorted 2000, 512 megs of chunk. Now when i merge them back, how does that work? Won't the size keep on increasing again? For example merging two 512 megs will make 1024Megs which is size of my RAM so how would this work? Any machine can't merge a chunk of more than 512 megs chunk with another chunk because then size > 1 GB.

How at the end of merging will i ever be able to merge two 0.5 TB chunk with another 0.5 TB chunk.. Does the concept of Virtual Memory come into play here?

I am here to clarify my basics and i hope i am asking this very important question (correctly) correctly. Also, who should do this merge(after sorting)? My machine or few of those 2000 machines?

like image 256
bruceparker Avatar asked Dec 22 '11 03:12

bruceparker


1 Answers

This problem can be reduced to a simpler problem. This problem was designed to force you to an approach. Here it is:

  • Pick up chunks =~ 1GB, sort & store them as separate sorted files.
  • You end up with 1000 1GB sorted files on the file system.
  • Now, its simply a problem of merging k-sorted arrays into a new array.

    Merging k-sorted arrays need you to maintain a min-heap (Priority Queue) with k elements at a time.

i.e. k = 1000 (files) in our case. ( 1GB ram can store 1000 numbers )

Therefore, keep poping elements from your priority queue and save to disk.

You will have a new file, sorted of size 1TB.

Refer: http://www.geeksforgeeks.org/merge-k-sorted-arrays/

Update

PS: Can be done on a single machine with 1 GB RAM with a better data structure

Merge can be done in less than O(N) space with Priority Queue i.e. O(K) space i.e. the heart of the problem.

like image 149
Yugal Jindle Avatar answered Sep 20 '22 15:09

Yugal Jindle