Hi I saw that as an interview question and thought it was an interesting question that I am not sure about the answer.
What would be the best way ?
Quick sort is the better suited for large data sets. [8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2).
Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.
In all cases tested, burstsort is the fastest string sorting method; for the largest data set, it is almost twice as fast as the best of the previous methods—and almost four times as fast as an efficient implemen- tation of quicksort (Bentley & McIlroy 1993).
Assuming *nix:
system("sort <input_file >output_file")
"sort" can use temporary files to work with input files larger than memory. It has switches to tune the amount of main memory and the number of temporary files it will use, if needed.
If not *nix, or the interviewer frowns because of the sideways answer, then I'll code an external merge sort. See @psyho's answer for a good summary of an external sorting algorithm.
Put them in a database and let the database worry about it.
One way to do this is to use an external sorting algorithm:
Well, this is an interesting interview question... almost all such kind of questions are meant to test your skills and don't, fortunately, directly apply to real-life examples. This looks like one, so let's get into the puzzle
When your interviewer asks for "best", I believe he/she talks about performance only.
30GB of strings is lot of data. All compare-swap algorithms are Omega(n logn), so it will take a long time. While there are O(n) algorithms, such as counting sort, they are not in place, so you will be multiplying the 30GB and you have only 4GB of RAM (consider the swapping amount...), so I would go with quicksort
Start thinking about counting sort. You may want to first split the strings in groups (using radix sort approach), one for each letter. You may want to scan the file and, for each initial letter, move the string (so copy and delete, no space waste) into a temporary file. You may want to repeat the process for the first 2, 3 or 4 chars of each string. Then, in order to reduce the complexity of sorting lots of files, you can separately sort the string within each one (using quicksort now) and finally merge all files in order. This way you'll still have a O(n logn) but on fair lower n
Database systems are already handling this particular problem well.
A good answer is to use the merge-sort algorithm, adapting it to spool data to and from disk as needed for the merge steps. This can be done with minimal demands on memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With