Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a file with huge volume of data given memory constraint

Tags:

java

file

sorting

Points:

  • We process thousands of flat files in a day, concurrently.
  • Memory constraint is a major issue.
  • We use thread for each file process.
  • We don't sort by columns. Each line (record) in the file is treated as one column.

Can't Do:

  • We cannot use unix/linux's sort commands.
  • We cannot use any database system no matter how light they can be.

Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.

In that situation, how would you sort the records/lines in a file?

like image 924
Erika Gomez Avatar asked Jan 18 '10 16:01

Erika Gomez


People also ask

How do I sort large files with limited memory?

Solution 1: step 1: Read M from the file and write it into a temp buffer. step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values. step 3: Create a temp file and write the buffer into the temp file.

How can I sort 10gb file with 1GB RAM?

For sorting 10 GB of data using only 1 GB of RAM: Read 1 GB of the data in main memory and sort by using quicksort. Write the sorted data to disk. Repeat steps 1 and 2 until all of the data is in sorted 1GB chunks (there are 10 GB / 1 GB = 10 chunks), which now need to be merged into one single output file.

How do you sort a huge amount of data which Cannot be loaded into RAM at once?

External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead, they must reside in the slower external memory (usually a hard drive). External sorting typically uses a hybrid sort-merge strategy.


1 Answers

It looks like what you are looking for is external sorting.

Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.

like image 198
phisch Avatar answered Sep 24 '22 15:09

phisch