Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I sort very large files

Tags:

java

file

sorting

I have some files that should be sorted according to id at the beginning of each line. The files are about 2-3 gb.

I tried to read all data into an ArrayList and sort them. But memory is not enough to keep them all. It does not work.

Lines look like

0052304 0000004000000000000000000000000000000041 John Teddy 000023
0022024 0000004000000000000000000000000000000041 George Clan 00013

How can I sort the files??

like image 305
Kayser Avatar asked Oct 27 '11 15:10

Kayser


People also ask

How do I sort a 10 GB file?

For sorting 10 GB of data using only 1 GB of RAM: Read 1 GB of the data in main memory and sort by using quicksort. Write the sorted data to disk. Repeat steps 1 and 2 until all of the data is in sorted 1GB chunks (there are 10 GB / 1 GB = 10 chunks), which now need to be merged into one single output file.

How do I sort large files with limited memory?

Solution 1: step 1: Read M from the file and write it into a temp buffer. step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values. step 3: Create a temp file and write the buffer into the temp file.

Which sorting is best for large data?

Quick sort is the better suited for large data sets. [8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2).


2 Answers

That isn't exactly a Java problem. You need to look into an efficient algorithm for sorting data that isn't completely read into memory. A few adaptations to Merge-Sort can achieve this.

Take a look at this: http://en.wikipedia.org/wiki/Merge_sort

and: http://en.wikipedia.org/wiki/External_sorting

Basically the idea here is to break the file into smaller pieces, sort them (either with merge sort or another method), and then use the Merge from merge-sort to create the new, sorted file.

like image 77
pcalcao Avatar answered Oct 11 '22 12:10

pcalcao


Since your records are already in flat file text format, you can pipe them into UNIX sort(1) e.g. sort -n -t' ' -k1,1 < input > output. It will automatically chunk the data and perform merge sort using available memory and /tmp. If you need more space than you have memory available, add -T /tmpdir to the command.

It's quite funny that everyone is telling you to download huge C# or Java libraries or implement merge-sort yourself when you can use a tool that is available on every platform and has been around for decades.

like image 21
rjh Avatar answered Oct 11 '22 14:10

rjh