Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sorting lines of an enormous file.txt in java

I'm working with a very big text file (755Mb). I need to sort the lines (about 1890000) and then write them back in another file.

I already noticed that discussion that has a starting file really similar to mine: Sorting Lines Based on words in them as keys

The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!)

I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded..

I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong?

Any hints appreciated Thanks in advance

like image 986
Lucia Belardinelli Avatar asked Jan 12 '12 09:01

Lucia Belardinelli


People also ask

How do I sort a file that is too large?

For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.

How do you sort data in a text file?

Although there's no straightforward way to sort a text file, we can achieve the same net result by doing the following: 1) Use the FileSystemObject to read the file into memory; 2) Sort the file alphabetically in memory; 3) Replace the existing contents of the file with the sorted data we have in memory.

Which is the best sorting technique in Java?

Heap sort is one of the most important sorting methods in java that one needs to learn to get into sorting. It combines the concepts of a tree as well as sorting, properly reinforcing the use of concepts from both.


1 Answers

I think the solution here is to do a merge sort using temporary files:

  1. Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.

  2. Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.

  3. Delete all the temporary files.

This works with arbitrary large files, as long as you have enough disk space.

like image 191
celtschk Avatar answered Oct 29 '22 17:10

celtschk