I have some files that should be sorted according to id at the beginning of each line. The files are about 2-3 gb.
I tried to read all data into an ArrayList
and sort them. But memory is not enough to keep them all. It does not work.
Lines look like
0052304 0000004000000000000000000000000000000041 John Teddy 000023
0022024 0000004000000000000000000000000000000041 George Clan 00013
How can I sort the files??
For sorting 10 GB of data using only 1 GB of RAM: Read 1 GB of the data in main memory and sort by using quicksort. Write the sorted data to disk. Repeat steps 1 and 2 until all of the data is in sorted 1GB chunks (there are 10 GB / 1 GB = 10 chunks), which now need to be merged into one single output file.
Solution 1: step 1: Read M from the file and write it into a temp buffer. step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values. step 3: Create a temp file and write the buffer into the temp file.
Quick sort is the better suited for large data sets. [8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2).
That isn't exactly a Java problem. You need to look into an efficient algorithm for sorting data that isn't completely read into memory. A few adaptations to Merge-Sort can achieve this.
Take a look at this: http://en.wikipedia.org/wiki/Merge_sort
and: http://en.wikipedia.org/wiki/External_sorting
Basically the idea here is to break the file into smaller pieces, sort them (either with merge sort or another method), and then use the Merge from merge-sort to create the new, sorted file.
Since your records are already in flat file text format, you can pipe them into UNIX sort(1)
e.g. sort -n -t' ' -k1,1 < input > output
. It will automatically chunk the data and perform merge sort using available memory and /tmp
. If you need more space than you have memory available, add -T /tmpdir
to the command.
It's quite funny that everyone is telling you to download huge C# or Java libraries or implement merge-sort yourself when you can use a tool that is available on every platform and has been around for decades.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With