This questions seems easy, but I am not able to understand the real work behind it. I know people will say, break down into 512 Megs chunks and sort them like using Merge Sort using Map reduce.
So here is the actual question i have:
Suppose i break the file into 512 Megs chunk and then send to different host machines to sort them. suppose these machines used the Merge Sort. Now say, i had 2000 machines each sorted 2000, 512 megs of chunk. Now when i merge them back, how does that work? Won't the size keep on increasing again? For example merging two 512 megs will make 1024Megs which is size of my RAM so how would this work? Any machine can't merge a chunk of more than 512 megs chunk with another chunk because then size > 1 GB.
How at the end of merging will i ever be able to merge two 0.5 TB chunk with another 0.5 TB chunk.. Does the concept of Virtual Memory come into play here?
I am here to clarify my basics and i hope i am asking this very important question (correctly) correctly. Also, who should do this merge(after sorting)? My machine or few of those 2000 machines?
This problem can be reduced to a simpler problem. This problem was designed to force you to an approach. Here it is:
Now, its simply a problem of merging k-sorted arrays into a new array.
Merging k-sorted arrays need you to maintain a min-heap (Priority Queue) with k elements at a time.
i.e. k = 1000 (files) in our case. ( 1GB ram can store 1000 numbers )
Therefore, keep poping elements from your priority queue and save to disk.
You will have a new file, sorted of size 1TB.
Refer: http://www.geeksforgeeks.org/merge-k-sorted-arrays/
Update
PS: Can be done on a single machine with 1 GB RAM with a better data structure
Merge can be done in less than O(N) space with Priority Queue i.e. O(K) space i.e. the heart of the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With