This is another question from cracking coding interview, I still have some doubt after reading it. <pre class="prettyprint"><code>9.4 If you have a 2 GB file with one string per line, which sorting algorithm would you use to sort the file and why? </code></pre> SOLUTION When an interviewer gives a size limit of 2GB, it should tell you something - in this case, it suggests that they don’t want you to bring all the data into memory. So what do we do? We only bring part of the data into memory.. Algorithm: How much memory do we have available? Let’s assume we have X MB of memory available. <ol> <li>Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.</li> <li>Now bring the next chunk into memory and sort.</li> <li>Once we’re done, merge them one by one.</li> </ol> The above algorithm is also known as external sort. Step 3 is known as N-way merge The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm. Doubt: When in step 3, doing the merge sort, while comparing 2 arrays, do we need 2*X space each time we compare? And the limit was X MB. Should we make the chunks into (X/2)*2K = 2GB? So that each chunk will be X/2 MB and there will be 2K chunks. Or I am just understanding the merge sort wrong? Thanks!

First off, step 3 itself is not a merge sort, the whole thing is a merge sort. Step 3 is just a merge, with no sorting involved at all. And as to the storage required, there are two possibilities. The first is to merge the sorted data in groups of two. Say you have three groups: <pre class="prettyprint"><code>A: 1 3 5 7 9 B: 0 2 4 6 8 C: 2 3 5 7 </code></pre> With that method, you would merge <code>A</code> and <code>B</code> in to a single group <code>Y</code> then merge <code>Y</code> and <code>C</code> into the final result <code>Z</code>: <pre class="prettyprint"><code>Y: 0 1 2 3 4 5 6 7 8 9 (from merging A and B). Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C). </code></pre> This has the advantage of a very small constant memory requirement in that you only ever need to store the "next" element from each of two lists but, of course, you need to do multiple merge operations. The second way is a "proper" N-way merge where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next: <pre class="prettyprint"><code>Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C). </code></pre> This involves only one merge operation but it requires more storage, basically one element per list. Which of these you choose depends on the available memory and the element size. For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability. Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity. So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists. You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.

The merging process is much simpler than that. You'll be outputting them to a new file, but basically you only need constant memory: you only need to read one element from each of the two input files at a time.

N-way merge sort a 2G file of strings

Tags:

java

algorithm

sorting

This is another question from cracking coding interview, I still have some doubt after reading it.

Click to copy

9.4 If you have a 2 GB file with one string per line, which sorting algorithm 
    would you use to sort the file and why?

SOLUTION

When an interviewer gives a size limit of 2GB, it should tell you something - in this case, it suggests that they don’t want you to bring all the data into memory. So what do we do? We only bring part of the data into memory.. Algorithm:

How much memory do we have available? Let’s assume we have X MB of memory available.

Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.

The above algorithm is also known as external sort. Step 3 is known as N-way merge The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm.

Doubt:

When in step 3, doing the merge sort, while comparing 2 arrays, do we need 2*X space each time we compare? And the limit was X MB. Should we make the chunks into (X/2)*2K = 2GB? So that each chunk will be X/2 MB and there will be 2K chunks. Or I am just understanding the merge sort wrong? Thanks!

632

asked May 21 '12 00:05

Ruobo Wang

3 Answers

http://en.wikipedia.org/wiki/External_sorting

A quick look on Wikipedia tells me that during the merging process you never hold a whole chunk in memory. So basically, if you have K chunks, you will have K open file pointers but you will only hold one line from each file in memory at any given time. You will compare the lines you have in memory and then output the smallest one (say, from chunk 5) to your sorted file (also an open file pointer, not in memory), then overwrite that line with the next line from that file (in our example, file 5) into memory and repeat until you reach the end of all the chunks.

163

answered Oct 20 '22 09:10

acattle

First off, step 3 itself is not a merge sort, the whole thing is a merge sort. Step 3 is just a merge, with no sorting involved at all.

And as to the storage required, there are two possibilities.

The first is to merge the sorted data in groups of two. Say you have three groups:

Click to copy

A: 1 3 5 7 9
B: 0 2 4 6 8
C: 2 3 5 7

With that method, you would merge A and B in to a single group Y then merge Y and C into the final result Z:

Click to copy

Y: 0 1 2 3 4 5 6 7 8 9         (from merging A and B).
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C).

This has the advantage of a very small constant memory requirement in that you only ever need to store the "next" element from each of two lists but, of course, you need to do multiple merge operations.

The second way is a "proper" N-way merge where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next:

Click to copy

Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C).

This involves only one merge operation but it requires more storage, basically one element per list.

Which of these you choose depends on the available memory and the element size.

For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability.

Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity.

So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists.

You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.

answered Oct 20 '22 08:10

paxdiablo

The merging process is much simpler than that. You'll be outputting them to a new file, but basically you only need constant memory: you only need to read one element from each of the two input files at a time.

answered Oct 20 '22 07:10

Louis Wasserman

Related questions
                            
                                Design problems: Reservation system
                            
                                Confused over initialisation of instance variables
                            
                                Is "ConcurrentHashMap.putAll(...)" atomic?
                            
                                "No matching ctor found" while trying to populate a Java class from Clojure
                            
                                Place components at arbitrary (x,y) coordinates
                            
                                What does @Override mean in this java code? [duplicate]
                            
                                Java memory model synchronization: how to induce data visibility bug?
                            
                                Does an instance of superclass get created when we instantiate an object?
                            
                                Generating an Abstract Syntax Tree for java source code using ANTLR
                            
                                Method call to another file
                            
                                How do I catch and recover from a stack overflow in Java?
                            
                                How to print upto two decimal places in java using string builder?
                            
                                logging input/output xml in apache xmlrpc client
                            
                                Javafx 2.0 How-to Application.getParameters() in a Controller.java file
                            
                                Java generics - The type parameter String is hiding the type String
                            
                                Spring Framework in simple terms
                            
                                Java check if an image has transparency
                            
                                Efficiency of creating new objects in a loop
                            
                                Java printing: creating a PageFormat with minimum acceptable margin
                            
                                How can I use typcasting inside a JPQL statement?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

N-way merge sort a 2G file of strings

Tags:

java

algorithm

sorting

Ruobo Wang

People also ask

3 Answers

acattle

paxdiablo

Louis Wasserman

Recent Activity

Donate For Us