Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting a 20GB file with one string per line

Tags:

sorting

In question 11.5 of Gayle Laakman's book, Cracking the Technical Interview,

"Imagine you have a 20GB file with one string per line. Explain how you would sort the file"

My initial reaction was exactly the solution that she proposed - splitting the file into smaller chunks (megabytes) by reading in X mb's of data, sorting it, and then writing it to disk. And at the very end, merge the files.

I decided not to pursue this approach because the final merge would involve holding on to all the data in main memory - and we're assuming that's not possible. If that's the case, how exactly does this solution hold?

My other approach is based on the assumption that we have near unlimited disk space, or at least enough to hold 2X the data we already have. We can read in X mb's of data and then generate hash keys for them - each key corresponding to a line in a file. We'll continue doing this until all values have been hashed. Then we just have to write the values of that file into the original file.

Let me know what you think.

like image 961
Kira Avatar asked Feb 11 '13 17:02

Kira


People also ask

How would you sort words in a large file?

For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.

How do I sort large files with small memory?

Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.

How do you sort data in a text file?

Although there's no straightforward way to sort a text file, we can achieve the same net result by doing the following: 1) Use the FileSystemObject to read the file into memory; 2) Sort the file alphabetically in memory; 3) Replace the existing contents of the file with the sorted data we have in memory.


1 Answers

http://en.wikipedia.org/wiki/External_sorting gives a more detailed explanation on how external sorting works. It addresses your concern of eventually having to bring the entire 20gB into memory by explaining how you perform the final merge of the N sorted chunks by reading in chunks of the sorted chunks as opposed to reading in all the sorted chunks at the same time.

like image 150
Sahir Contractor Avatar answered Sep 30 '22 21:09

Sahir Contractor