Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to sort 30gb of strings with a computer with 4gb of RAM using Ruby as scripting language?

Hi I saw that as an interview question and thought it was an interesting question that I am not sure about the answer.

What would be the best way ?

like image 646
Cristiano Fontes Avatar asked Jan 17 '11 14:01

Cristiano Fontes


People also ask

Which sorting technique is best for large data?

Quick sort is the better suited for large data sets. [8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2).

How do I sort large files with small memory?

Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.

What is the fastest algorithm to sort a string?

In all cases tested, burstsort is the fastest string sorting method; for the largest data set, it is almost twice as fast as the best of the previous methods—and almost four times as fast as an efficient implemen- tation of quicksort (Bentley & McIlroy 1993).


5 Answers

Assuming *nix:

system("sort <input_file >output_file")

"sort" can use temporary files to work with input files larger than memory. It has switches to tune the amount of main memory and the number of temporary files it will use, if needed.

If not *nix, or the interviewer frowns because of the sideways answer, then I'll code an external merge sort. See @psyho's answer for a good summary of an external sorting algorithm.

like image 172
Wayne Conrad Avatar answered Oct 26 '22 23:10

Wayne Conrad


Put them in a database and let the database worry about it.

like image 42
Ignacio Vazquez-Abrams Avatar answered Oct 27 '22 00:10

Ignacio Vazquez-Abrams


One way to do this is to use an external sorting algorithm:

  1. Read a chunk of file into memory
  2. Sort that chunk using any regular sorting algorithm (like quicksort)
  3. Output the sorted strings into a temporary file
  4. Repeat steps 1-3 until you process the whole file
  5. Apply the merge-sort algorithm by reading the temporary files line by line
  6. Profit!
like image 38
psyho Avatar answered Oct 26 '22 22:10

psyho


Well, this is an interesting interview question... almost all such kind of questions are meant to test your skills and don't, fortunately, directly apply to real-life examples. This looks like one, so let's get into the puzzle

When your interviewer asks for "best", I believe he/she talks about performance only.

Answer 1

30GB of strings is lot of data. All compare-swap algorithms are Omega(n logn), so it will take a long time. While there are O(n) algorithms, such as counting sort, they are not in place, so you will be multiplying the 30GB and you have only 4GB of RAM (consider the swapping amount...), so I would go with quicksort

Answer 2 (partial)

Start thinking about counting sort. You may want to first split the strings in groups (using radix sort approach), one for each letter. You may want to scan the file and, for each initial letter, move the string (so copy and delete, no space waste) into a temporary file. You may want to repeat the process for the first 2, 3 or 4 chars of each string. Then, in order to reduce the complexity of sorting lots of files, you can separately sort the string within each one (using quicksort now) and finally merge all files in order. This way you'll still have a O(n logn) but on fair lower n

like image 42
usr-local-ΕΨΗΕΛΩΝ Avatar answered Oct 26 '22 23:10

usr-local-ΕΨΗΕΛΩΝ


Database systems are already handling this particular problem well.

A good answer is to use the merge-sort algorithm, adapting it to spool data to and from disk as needed for the merge steps. This can be done with minimal demands on memory.

like image 30
yfeldblum Avatar answered Oct 27 '22 00:10

yfeldblum